Oncall: An Equal Opportunity Waste of Time

Date Tuesday, 25 October, 2022 - 11:00–11:40
Presenter Dave O’Connor, Twilio
URL https://www.usenix.org/conference/srecon22emea/presentation/oconnor

Abstract

Live 24/7 support of production services is often completely ingrained into the model in stakeholders’ minds of what SRE does. It remains a huge pert of the “value” of most SRE groups. This talk explores what might happen if it wasn’t. How do SRE demonstrate their value in a post-oncall world? How do we aim toward that place? Also, even if you don’t get to throw your pager in the sea tomorrow, how can you apply these principles anyway?

Notes

  • Background: Google SRE (7), Elastic SRE (2), Twilio SRE (1, current)
  • The reward for good work is more work
  • “If you’re still doing the same thing next year, you’re probably not moving fast enough” and will be rewarded with more work
  • “Main problem I have with on-call is that it’s considered to be special”
  • Gatekeeping on-call (requiring special training, etc.) works against simplifying/minimizing on-call.
  • “Devs should be on-call” is fairly pervasive in the industry. You can teach people about incidents & impact of product decisions on incidents without subjecting them to the pain of on-call (there are examples of organizations where on-call work is voluntary and doesn’t suck)
  • SRE value should not follow from how many times SREs pull people out of the fire (by handling incidents).
  • Thought experiments:
    • Imagine there is no on-call at all. What would you do?
      • Answers that come up almost always different from what is currently being done. Leads to the question, is our team too big then? Are we doing the wrong things?
    • Say dev and SRE should do equal amounts of on-call: what would that look like?
      • At this point, an example was given of Twilio where incident command was done by a dedicated team that would act as IC, ensure Post-incident reviews happen, follow-ups are done, etc.

There’s a HangOps Community conversation about the text-version of this talk which has a lot of historical insights: 20221123-1230 - HangOps conversation discussing some of the origins of Google SRE