Too much reliability

When we think about the availability of IT services, it’s easy to fall into a trap of wanting systems to be 100% reliable all the time. Errors are undesirable and when systems are down, that means somewhere out there someone will be having a bad experience. Obviously we want to prevent this from happening as much as possible.

But it’s easy to forget that as we pursue higher and higher levels of reliability, the cost and investment needed to maintain that level of service also goes up - often exponentially. Setting the right target for reliability is really an exercise in making trade-offs for both risk acceptance and investment cost (bigger investments come with greater Opportunity Cost) .

The Google SRE book has an entire chapter devoted to this titled Embracing Risk, which opens as follows:

You might expect Google to try to build 100% reliable services—ones that never fail. It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer. Further, users typically don’t notice the difference between high reliability and extreme reliability in a service, because the user experience is dominated by less reliable components like the cellular network or the device they are working with. Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability! With this in mind, rather than simply maximizing uptime, Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness—with features, service, and performance—is optimized.

Site Reliability Engineering, Chapter 3 - Embracing Risk