Outlier detection

    Depending on the type of outlier detection, ejection either runs inline (for example in the case of consecutive 5xx) or at a specified interval (for example in the case of periodic success rate). The ejection algorithm works as follows:

    1. A host is determined to be an outlier.
    2. If no hosts have been ejected, Envoy will eject the host immediately. Otherwise, it checks to make sure the number of ejected hosts is below the allowed threshold (specified via the setting). If the number of ejected hosts is above the threshold, the host is not ejected.
    3. An ejected host will automatically be brought back into service after the ejection time has been satisfied. Generally, outlier detection is used alongside active health checking for a comprehensive health checking solution.

    Envoy supports the following outlier detection types:

    If an upstream host returns some number of consecutive 5xx, it will be ejected. Note that in this case a 5xx means an actual 5xx respond code, or an event that would cause the HTTP router to return one on the upstream’s behalf (reset, connection failure, etc.). The number of consecutive 5xx required for ejection is controlled by the value.

    If an upstream host returns some number of consecutive “gateway errors” (502, 503 or 504 status code), it will be ejected. Note that this includes events that would cause the HTTP router to return one of these status codes on the upstream’s behalf (reset, connection failure, etc.). The number of consecutive gateway failures required for ejection is controlled by the outlier_detection.consecutive_gateway_failure value.

    Success Rate based outlier ejection aggregates success rate data from every host in a cluster. Then at given intervals ejects hosts based on statistical outlier detection. Success Rate outlier ejection will not be calculated for a host if its request volume over the aggregation interval is less than the value. Moreover, detection will not be performed for a cluster if the number of hosts with the minimum required request volume in an interval is less than the outlier_detection.success_rate_minimum_hosts value.

    A log of outlier ejection events can optionally be produced by Envoy. This is extremely useful during daily operations since global stats do not provide enough information on which hosts are being ejected and for what reasons. The log uses a JSON format with one object per line:

    time

    The time that the event took place.

    The time in seconds since the last action (either an ejection or unejection) took place. This value will be for the first ejection given there is no action before the first ejection.

    cluster

    The that owns the ejected host.

    upstream_url

    The URL of the ejected host. E.g., tcp://1.2.3.4:80.

    action

    The action that took place. Either eject if a host was ejected or uneject if it was brought back into service.

    type

    num_ejections

    If is eject, specifies the number of times the host has been ejected (local to that Envoy and gets reset if the host gets removed from the upstream cluster for any reason and then re-added).

    enforced

    If action is eject, specifies if the ejection was enforced. true means the host was ejected. false means the event was logged but the host was not actually ejected.

    host_success_rate

    If action is eject, and type is , specifies the host’s success rate at the time of the ejection event on a 0-100 range.

    cluster_success_rate_average

    If action is eject, and type is SuccessRate, specifies the average success rate of the hosts in the cluster at the time of the ejection event on a 0-100 range.

    If action is eject, and is SuccessRate, specifies success rate ejection threshold at the time of the ejection event.