Just Another Technology Guy: Post Mortems

In my previous post on Continuous Delivery I mentioned using postmortems as a way to share responsibility and create action items to address and prevent failures as part of a change. I thought it would be helpful if I took the time to explain how I like to run postmortem meetings.

To me a good postmortem meeting has five distinct parts. Each part has a goal that's measurable and useful in understanding a failure and preventing that failure from being exhibited in the system again. It's important to go through these parts sequentially and in the order defined below.

Identify what happened

This may seem like common sense, but a good postmortem should always start with a chronological list of the sequence of events. These events should focus on the WHAT and not the WHY or the HOW. The purpose of this part is to get everyone on the same page as to what the sequence of events were that lead up to the incident, happened during the incident, and lead up to the resolution of the incident.

Identify what went wrong during the incident

After the sequence of events has been identified and everyone understands the WHAT of the incident, it's important to understand what went wrong during those sequence of events. It's extremely important to try to keep this factual. If you can't back it up with data, don't include it. Again, that may sound like common sense but it's very easy to get into a finger pointing argument or start blaming other folks or groups during this part.

It's important to call out things like coordination failures, escalation issues, trouble shooting guide issues, or anything else that impeded resolution of the incident. The main goal of this is to take a constructive look at how the existing processes worked and identify holes or gaps in the process.

Identify what went well

After you've identified what went wrong during the incident it's also very important to identify what went well. This is important from more than just a moral perspective. It's important for the group to come to consensus and understand things that accelerated resolving the incident so that they can be included and continued in future incidents. It's also a good way to recognize the efforts of team members or other product groups that put skin in the game during an incident.

During this part it's important to call out things like issues that were identified by an automated system or that were recognized early by some standard operating procedure. It's also important to call out good coordination between teams, any heroic individual or team efforts, as well as any new mitigation strategies that were created or discovered during the incident.

Bring up what issues have been identified as a result of the incident

The main goal of this part is to analyze the data presented in part one in light of what went well or poorly. During this part you want to clearly identify gaps that need to be filled as a result of this incident. This is not the time to bring up HOW to resolve the issues but simply to identify the issues.

During this part it's important to call out gaps in communication or process around incident status notifications. If holes have been found in the trouble shooting guide this is the right time to bring them up. Other things like team members having incorrect permissions or access problems should come up during this part as should lack of training or improper knowledge of third party (or first party) systems.

Identify actionable improvements

By this point you should have a list of things that can be improved. This list may include process improvements, documentation improvements, or software improvements. It's important during this phase to make those improvements actionable. The goal of this part is to walk out of the postmortem with not just a list of what actions need to be taken, but also who is going to take the action. It's very important that one of the goals of the postmortem is that there is a clear understanding of what changes need to be made and accountability for getting those changes implemented.

Just Another Technology Guy

Monday, May 5, 2014

Post Mortems

No comments:

Post a Comment