Express system availability as a ratio of events

Uptime is a bad proxy for availability. Instead of measuring a system’s availability as the time that it was on-line and able to serve requests, a better approach is to express a system’s availability in terms of “how many times did the system successfully do the thing it was designed to do?”.

This means thinking in terms of events. For an E-commerce system, these events might be “add product to shopping cart” or “complete order payment”. It is quite possible that you will want to track certain events separately, because they have different levels of importance.

While doing this, you should also classify and discard any “bad” events which might skew the data. For example, a user abandoning their shopping cart because they change their mind probably shouldn’t be counted as a valid event if you’re trying to measure the availability of a check-out process. But a user unable to complete the process because your payment provider’s API is down should count.

Generalized, this gives the following formula:

$A v a i l a b i l i t y = \frac{successful events}{total events - disqualified events} * 100$

Including latency

The above focused on success/error rates, but this method of measuring can be used to incorporate latency within the same metric. Latency is one of the 4 golden signals of monitoring, and something which is impossible to represent in the concept of “uptime”.

By including a latency target in your measurements (that is to say, counting events as “bad” when they take too long to complete) you can express both error-rates and latency within the same measure.

This can be helpful and important because very slow responses are just as likely to turn people away as high rates of errors are.