You put out the fire on production, what’s next?

September 28, 2023

Sh*t happens, releases don't always go as planned, production systems break sometimes. Whether it's bug in your code, library you use or infrastructure/hardware failure first thing you need is to bring the system back to life with minimal damage.

But what happens next? Should we grab a coffee and carry on with our daily tasks?

Have you heard about a ritual called "Post Mortem Analysis"? Seems scary but actually it's incredibly valuable when done properly and that's what I'd like to talk about. What it is, how to conduct such analysis, who should and who shouldn't be involved, what to watch for on the go and what are its ultimate goals. All in all you don't want to end up with the same production outage tomorrow, do you?