Oncall: An Equal Opportunity Waste of Time

Date Tuesday, 25 October, 2022 - 11:00–11:40
Presenter Dave O’Connor, Twilio
URL https://www.usenix.org/conference/srecon22emea/presentation/oconnor


Live 24/7 support of production services is often completely ingrained into the model in stakeholders’ minds of what SRE does. It remains a huge pert of the “value” of most SRE groups. This talk explores what might happen if it wasn’t. How do SRE demonstrate their value in a post-oncall world? How do we aim toward that place? Also, even if you don’t get to throw your pager in the sea tomorrow, how can you apply these principles anyway?


  • Background: Google SRE (7), Elastic SRE (2), Twilio SRE (1, current)
  • The reward for good work is more work
  • “If you’re still doing the same thing next year, you’re probably not moving fast enough” and will be rewarded with more work
  • “Main problem I have with on-call is that it’s considered to be special”
  • Gatekeeping on-call (requiring special training, etc.) works against simplifying/minimizing on-call.
  • “Devs should be on-call” is fairly pervasive in the industry. You can teach people about incidents & impact of product decisions on incidents without subjecting them to the pain of on-call (there are examples of organizations where on-call work is voluntary and doesn’t suck)
  • SRE value should not follow from how many times SREs pull people out of the fire (by handling incidents).
  • Thought experiments:
    • Imagine there is no on-call at all. What would you do?
      • Answers that come up almost always different from what is currently being done. Leads to the question, is our team too big then? Are we doing the wrong things?
    • Say dev and SRE should do equal amounts of on-call: what would that look like?
      • At this point, an example was given of Twilio where incident command was done by a dedicated team that would act as IC, ensure Post-incident reviews happen, follow-ups are done, etc.

There’s a HangOps Community conversation about the text-version of this talk which has a lot of historical insights: 20221123-1230 - HangOps conversation discussing some of the origins of Google SRE