Break Free of the Template: Incident Writeups They Want to Read
Date | Wednesday, 26 October, 2022 - 11:00–11:40 |
Presenter | Laura Nolan, Stanza |
URL | https://www.usenix.org/conference/srecon22emea/presentation/nolan-break |
Abstract
Most of us write incident reviews (IRs) or postmortems occasionally. Unfortunately, many IRs are never read by anyone other than those involved in the incident, and therefore have limited benefit to an organisation.
However, IRs that are well-crafted can create learning that will last in your organisation for years (and maybe even beyond). This talk will give practical advice on how to write the most engaging and valuable IR possible.
Notes
- Value of written post-incident reports (especially compared to verbal or async review):
- Share knowledge and context, also across teams
- Helps org understand and adapt
- Encourages thoughtful reflection
- Long-term store of knowledge
- Onboarding
- Reference to the Verica Open Incident Database
- Share knowledge industry wide
- Can help take sharp edges off SaaS services/tools
- Value of PIRs is in the learning, not the process
- Examples of good incident write-ups:
- https://hacks.mozilla.org/2022/02/retrospective-and-technical-details-on-the-recent-firefox-outage/
- https://about.gitlab.com/blog/2019/11/08/the-consul-outage-that-never-happened/
- https://blog.sentry.io/2015/07/23/transaction-id-wraparound-in-postgres/
- https://gocardless.com/blog/incident-review-service-outage-on-25-october-2020/
- https://f.hubspotusercontent30.net/hubfs/5193039/Engineering%20Retrospectives/Do%20You%20Remember%20the%2020%20Fires%20of%20September%20-%20November%202021.pdf
- Any engineer should be able to read any IR, so:
- Explain jargon/system names
- Explain why things are the way they are
- Weave (concise) explanations into the narrative, and/or link out to more detailed documentation if applicable
- Be visual: Use pictures, graphs, (architecture, topology, sequence, etc) diagrams, etc.
- Don’t skip the analysis
- If an IR is a story, the analysis is actually the moral of the story
- Analysis and lessons learned are satisfying to read
- Not mentioned, but personal note: And they teach others how to reason through similar events
- Craft
- Think of good and descriptive titles that people will remember
- Use simple and clear language (avoid jargon, use simple terms, proper structure, no walls of text, consistent tense, etc) - basically tech writing 101
- Don’t be too formal, nor too informal
- Avoid confusing or obscure cultural references or metaphors
To the question of “Should you always do this?”, “Is it okay to use a template sometimes?”, Laura answers you should (probably) only do this for incidents which warrant it. The ones that are really gnarly or interesting to learn from.
Some incidents are just repeats or cut-and-dry events which aren’t worth that level of effort. It’s perfectly okay to use a shorter form/template for those and to quickly get them over with.
Other insightful comments from Laura (from Slack):
I do advocate for reserving this kind of intensive effort for the most interesting and impactful writeups. It absolutely does take time and energy. And that does pay more dividends in larger orgs where you have
- more adjacent teams who might benefit from the context
- more new joiners who need to get up to speed on systems and org
We all have to make choices about where the best payoff for our time is, and context matters, absolutely. (edited)
I guess I see these kinds of writeups as a powerful form of organisational memory. I think most of us agree this has value, but then none of us have all the time we need to do all the things we want.