Exponential backoff

In networked and distributed systems, remote calls are known to fail from time to time for any number of reasons. Clients often deal with such Transient Failures by implementing a retry-strategy, where failed calls are simply retried until they either succeed, or error too many times and the system eventually gives up.

In a naive implementation where retries are simply performed in a fixed interval (say, each 1 second apart) this is likely to result in a Thundering Herd situation.

Exponential backoff is a well-known retry strategy in control theory and computer science which aims to make such retries more efficient for the system as a whole by doubling the time interval between consecutive retries.

The retry delay is usually capped at some maximum. For example, once the retry delay hits 1 minute, the next rety might again be 1 minute instead of doubling to 2 minutes. Formally, this is called truncated exponential backoff.

For example, if a network call fails, it is retried 1 second later. If it then fails again, the system waits 2 seconds before retrying, and so on.

HR assistantHR systemDatabaseExport employee recordsQuery recordsError: overloadedSleep 1 second (+jitter)Query recordsError: connection refusedSleep 2 seconds (+jitter)Query recordsError: system startingSleep 4 seconds (+jitter)Query recordsReturn resultsDownload exportHR assistantHR systemDatabase

Jitter

Robust, mature implementations also introduce a slight, randomized delay (called jitter) alongside each retry. This aims to spread out different clients so that they are not all retrying requests at the exact same moment, but gradually become equally distributed over time.

An good explanation of why this is important is available on the AWS Architecture blog in Exponential Backoff And Jitter.

Further reading