Retries and Timeouts

    Timeouts work hand in hand with retries. Once requests are retried a certain number of times, it becomes important to limit the total amount of time a client waits before giving up entirely. Imagine a number of retries forcing a client to wait for 10 seconds.

    A service profile may define certain routes as retryable or specify timeouts for routes. This will cause the Linkerd proxy to perform the appropriate retries or timeouts when calling that service. Retries and timeouts are always performed on the outbound (client) side.

    Traditionally, when performing retries, you must specify a maximum number of retry attempts before giving up. Unfortunately, there are two major problems with configuring retries this way.

    You need to pick a number that’s high enough to make a difference; allowing more than one retry attempt is usually prudent and, if your service is less reliable, you’ll probably want to allow several retry attempts. On the other hand, allowing too many retry attempts can generate a lot of extra requests and extra load on the system. Performing a lot of retries can also seriously increase the latency of requests that need to be retried. In practice, you usually pick a maximum retry attempts number out of a hat (3?) and then tweak it through trial and error until the system behaves roughly how you want it to.

    Systems configured this way are vulnerable to retry storms

    Retry Budgets to the Rescue

    To avoid the problems of retry storms and arbitrary numbers of retry attempts, retries are configured using retry budgets. Rather than specifying a fixed maximum number of retry attempts per request, Linkerd keeps track of the ratio between regular requests and retries and keeps this number below a configurable limit. For example, you may specify that you want retries to add at most 20% more requests. Linkerd will then retry as much as it can while maintaining that ratio.

    Configuring retries is always a trade-off between improving success rate and not adding too much extra load to the system. Retry budgets make that trade-off explicit by letting you specify exactly how much extra load your system is willing to accept from retries.