Knowledge and Power: A Sociotechnical Systems Discussion on the Future of SRE

Date Tuesday, 25 October, 2022 - 09:00–09:45
Presenter Laura Maguire, Jeli, and Lorin Hochstein, Netflix
URL https://www.usenix.org/conference/srecon22emea/presentation/maguire

Abstract

This talk shares the findings from a series of exploratory discussions between two prominent Site Reliability Engineers from industry-leading organizations and two socio-technical systems researchers with extensive experience in distributed human-machine teaming.

Drawing from both academic and industry perspectives, this talk elaborates on topics relevant to socio-technical software systems - both practical considerations and philosophical concerns. The practical includes: the tradeoffs inherent in balancing operational load and work in support of feature delivery with the less-immediately-tangible - but no less important - work of learning about our systems; sharing knowledge as a team and; using that knowledge to reduce risk. Our philosophical inquiries relate to the impact of the history of SRE on its future, meditations on the ‘practice’ and values of SRE and provocative promising new directions.

Attendees will come away with new perspectives on how knowledge and power structures operate in their organizations, shaping the ways that we conduct and understand our work.

Notes

  • Knowledge is gained through practice. How do we share this? What happens with it when people leave?
  • In Complex Systems, everyone has a partial and incomplete knowledge/understanding of the system.
  • Many incident reports don’t explain how people investigated incidents. It doesn’t say where they looked, why they chose to investigate a certain thing and not another, etc.
  • “Systems are hard and messy” is intuitive for people doing the actual work, but difficult for management to grasp. How can we explain these things to management/leadership? Shallow metrics are tempting here, but they are not the answer.
  • Keep The Lights On (KTLO) tends not to be included in OKRs, which influences what people do and do not work on.
  • Abstractions (platforms) can reduce ops for teams, but ops work is like a muscle that atrophies when not exercised. Lorin worries that there will be fewer incidents, but that when they do happen, they’ll be more complex and more difficult to investigate and respond to.
  • Compliance/regulators are always catching up. This is at best irritating for practitioners, but at its worst actively dangerous. As an industry, it’s better to be self-led rather than to wait for regulation to happen (regulation is guaranteed to happen more to our industry over time).