Postmortem

Written by Perry Lee

Sometimes, when a deployment goes out to production, things break. Besides coming up with a solution to fix the problem, and finding out who to blame, we have to ask, “What else could we have done better?”

I used to work for a team where people would find someone to blame when a bad release went to production. Upper managers would go nuts and say things like “You need to fix your s#!%. We can’t let this happen.” I didn’t blame them, but I also don’t need someone to tell me that. If being a supervisor just means blasting out some bad words and hoping it won’t happen the next time because of their incredible hulk roar, I would have done a better job than they did. We as developers already understand how frustrating it is when our builds go into the wild and break the production.

I learned at UserTesting that postmortem is not about finding out who to blame. The most important thing is what we learn from our mistakes, share that with everyone on the team, and figure out how we can mitigate these risks in the future. Here are some key parts of a successful postmortem:

What happened?

Why is this important? This tells your team and people involved what happened so they can understand the problem.

Root causes?

Why does the incidence occur? Maybe the problem is not something to do with the developer or the build. Maybe it is because of the infrastructure of a system. Or maybe it’s the some other fundamental problem. Maybe… I say… maybe someone shouldn’t deploy after happy hour on Friday night.

What we learned?

Knowing the problem is not enough. We also want to learn from it. George Santayana says, “Those who cannot remember the past are condemned to repeat it.” We don’t want to repeat our mistakes.

Preventative actions

Now we learned a lot from our mistakes. We don’t just sit here and do nothing. Actions are required to prevent future issues. It could be something like, “Everyone keep an eye on that one developer who wants to release something after one too many drinks at happy hour on Friday.” Whatever it is, communicate it to the team so everyone learns from the mistake without having to make the mistake themselves.