Exponential backoff

In networked and distributed systems, remote calls are known to fail from time to time for any number of reasons. Clients often deal with such Transient Failures by implementing a retry-strategy, where failed calls are simply retried until they either succeed, or error too many times and the system eventually gives up.

In a naive implementation where retries are simply performed in a fixed interval (say, each 1 second apart) this is likely to result in a Thundering Herd situation.

Exponential backoff is a well-known retry strategy in control theory and computer science which aims to make such retries more efficient for the system as a whole by doubling the time interval between consecutive retries.

The retry delay is usually capped at some maximum. For example, once the retry delay hits 1 minute, the next rety might again be 1 minute instead of doubling to 2 minutes. Formally, this is called truncated exponential backoff.

For example, if a network call fails, it is retried 1 second later. If it then fails again, the system waits 2 seconds before retrying, and so on.

sequenceDiagram
  HR assistant->>HR system: Export employee records
  HR system->>Database: Query records
  Database-->>HR system: Error: overloaded

  Note over HR system, Database: Sleep 1 second (+jitter)
  HR system-->>Database: Query records
  Database-->>HR system: Error: connection refused

  Note over HR system, Database: Sleep 2 seconds (+jitter)
  HR system-->>Database: Query records
  Database-->>HR system: Error: system starting

  Note over HR system, Database: Sleep 4 seconds (+jitter)
  HR system-->>Database: Query records
  Database->>HR system: Return results

  HR system->>HR assistant: Download export

Jitter

Robust, mature implementations also introduce a slight, randomized delay (called jitter) alongside each retry. This aims to spread out different clients so that they are not all retrying requests at the exact same moment, but gradually become equally distributed over time.

An good explanation of why this is important is available on the AWS Architecture blog in Exponential Backoff And Jitter.

Further reading