What is Site Reliability Engineering?
The term Site Reliability Engineering (SRE) originates from Google and describes the methodology and management practices Google uses to run its infrastructure. In the years since Google first started publishing and talking about its methods, SRE has been adopted by many software companies both large and small.
While the field of SRE is broad, Google’s original vision centers around two main concepts:
-
Software Engineering in System Design: Systems should be well-designed and SREs should have an engineering background. Automation should be applied in favor of manual ops (Toil) as manual work does not scale.
For this reason, SRE has sometimes been described as what happens when you let software engineers run infrastructure.
-
Sufficient, but not too much reliability: Operations teams want stable systems while product development teams want to ship new features. As changes to a running system are inherently risky 1, this causes these teams to be fundamentally at odds with each other.
SRE practice places a strong emphasis on defining suitable Service Level Objectives and Service Level Agreements, which then translates into an Error Budget. The error budget stems from the observation that 100% is almost always the wrong reliability target.
As systems trigger errors and become unavailable, they exhaust their error budget. When the error budget is exhausted, development and deployment of new features is paused in favor of reliability enhancements.
-
The chapter Introduction in Google’s first SRE book, section Change Management states: “SRE has found that roughly 70% of outages are due to changes in a live system.” ↩︎