Statistics

    The cluster manager has a statistics tree rooted at cluster_manager. with the following statistics. Any character in the stats name is replaced with . Stats include all clusters managed by the cluster manager, including both clusters used for data plane upstreams and control plane xDS clusters.

    Every cluster has a statistics tree rooted at cluster.<name>. with the following statistics:

    Name

    Type

    Description

    upstream_cx_total

    Counter

    Total connections

    upstream_cx_active

    Gauge

    Total active connections

    upstream_cx_http1_total

    Counter

    Total HTTP/1.1 connections

    upstream_cx_http2_total

    Counter

    Total HTTP/2 connections

    upstream_cx_http3_total

    Counter

    Total HTTP/3 connections

    upstream_cx_connect_fail

    Counter

    Total connection failures

    upstream_cx_connect_timeout

    Counter

    Total connection connect timeouts

    upstream_cx_connect_with_0_rtt

    Counter

    Total connections able to send 0-rtt requests (early data).

    upstream_cx_idle_timeout

    Counter

    Total connection idle timeouts

    upstream_cx_max_duration_reached

    Counter

    Total connections closed due to max duration reached

    upstream_cx_connect_attempts_exceeded

    Counter

    Total consecutive connection failures exceeding configured connection attempts

    upstream_cx_overflow

    Counter

    Total times that the cluster’s connection circuit breaker overflowed

    upstream_cx_connect_ms

    Histogram

    Connection establishment milliseconds

    upstream_cx_length_ms

    Histogram

    Connection length milliseconds

    upstream_cx_destroy

    Counter

    Total destroyed connections

    upstream_cx_destroy_local

    Counter

    Total connections destroyed locally

    upstream_cx_destroy_remote

    Counter

    Total connections destroyed remotely

    upstream_cx_destroy_with_active_rq

    Counter

    Total connections destroyed with 1+ active request

    upstream_cx_destroy_local_with_active_rq

    Counter

    Total connections destroyed locally with 1+ active request

    upstream_cx_destroy_remote_with_active_rq

    Counter

    Total connections destroyed remotely with 1+ active request

    upstream_cx_close_notify

    Counter

    Total connections closed via HTTP/1.1 connection close header or HTTP/2 or HTTP/3 GOAWAY

    upstream_cx_rx_bytes_total

    Counter

    Total received connection bytes

    upstream_cx_rx_bytes_buffered

    Gauge

    Received connection bytes currently buffered

    upstream_cx_tx_bytes_total

    Counter

    Total sent connection bytes

    upstream_cx_tx_bytes_buffered

    Gauge

    Send connection bytes currently buffered

    upstream_cx_pool_overflow

    Counter

    Total times that the cluster’s connection pool circuit breaker overflowed

    upstream_cx_protocol_error

    Counter

    Total connection protocol errors

    upstream_cx_max_requests

    Counter

    Total connections closed due to maximum requests

    upstream_cx_none_healthy

    Counter

    Total times connection not established due to no healthy hosts

    upstream_rq_total

    Counter

    Total requests

    upstream_rq_active

    Gauge

    Total active requests

    upstream_rq_pending_total

    Counter

    Total requests pending a connection pool connection

    upstream_rq_pending_overflow

    Counter

    Total requests that overflowed connection pool or requests (mainly for HTTP/2 and above) circuit breaking and were failed

    upstream_rq_pending_failure_eject

    Counter

    Total requests that were failed due to a connection pool connection failure or remote connection termination

    upstream_rq_pending_active

    Gauge

    Total active requests pending a connection pool connection

    upstream_rq_cancelled

    Counter

    Total requests cancelled before obtaining a connection pool connection

    upstream_rq_maintenance_mode

    Counter

    Total requests that resulted in an immediate 503 due to

    upstream_rq_timeout

    Counter

    Total requests that timed out waiting for a response

    upstream_rq_max_duration_reached

    Counter

    Total requests closed due to max duration reached

    upstream_rq_per_try_timeout

    Counter

    Total requests that hit the per try timeout (except when request hedging is enabled)

    upstream_rq_rx_reset

    Counter

    Total requests that were reset remotely

    upstream_rq_tx_reset

    Counter

    Total requests that were reset locally

    upstream_rq_retry

    Counter

    Total request retries

    upstream_rq_retry_backoff_exponential

    Counter

    Total retries using the exponential backoff strategy

    upstream_rq_retry_backoff_ratelimited

    Counter

    Total retries using the ratelimited backoff strategy

    upstream_rq_retry_limit_exceeded

    Counter

    Total requests not retried due to exceeding the configured number of maximum retries

    upstream_rq_retry_success

    Counter

    Total request retry successes

    upstream_rq_retry_overflow

    Counter

    Total requests not retried due to circuit breaking or exceeding the

    upstream_flow_control_paused_reading_total

    Counter

    Total number of times flow control paused reading from upstream

    upstream_flow_control_resumed_reading_total

    Counter

    Total number of times flow control resumed reading from upstream

    upstream_flow_control_backed_up_total

    Counter

    Total number of times the upstream connection backed up and paused reads from downstream

    upstream_flow_control_drained_total

    Counter

    Total number of times the upstream connection drained and resumed reads from downstream

    upstream_internal_redirect_failed_total

    Counter

    Total number of times failed internal redirects resulted in redirects being passed downstream.

    upstream_internal_redirect_succeeded_total

    Counter

    Total number of times internal redirects resulted in a second upstream request.

    membership_change

    Counter

    Total cluster membership changes

    membership_healthy

    Gauge

    Current cluster healthy total (inclusive of both health checking and outlier detection)

    Gauge

    Current cluster degraded total

    membership_excluded

    Gauge

    Current cluster total

    membership_total

    Gauge

    Current cluster membership total

    retry_or_shadow_abandoned

    Counter

    Total number of times shadowing or retry buffering was canceled due to buffer limits

    config_reload

    Counter

    Total API fetches that resulted in a config reload due to a different config

    update_attempt

    Counter

    Total attempted cluster membership updates by service discovery

    update_success

    Counter

    Total successful cluster membership updates by service discovery

    update_failure

    Counter

    Total failed cluster membership updates by service discovery

    update_duration

    Histogram

    Amount of time spent updating configs

    update_empty

    Counter

    Total cluster membership updates ending with empty cluster load assignment and continuing with previous config

    update_no_rebuild

    Counter

    Total successful cluster membership updates that didn’t result in any cluster load balancing structure rebuilds

    version

    Gauge

    Hash of the contents from the last successful API fetch

    max_host_weight

    Gauge

    Maximum weight of any host in the cluster

    bind_errors

    Counter

    Total errors binding the socket to the configured source address

    assignment_timeout_received

    Counter

    Total assignments received with endpoint lease information.

    assignment_stale

    Counter

    Number of times the received assignments went stale before new assignments arrived.

    HTTP/3 protocol stats are global with the following statistics:

    Name

    Type

    Description

    upstream.<tx/rx>.quicconnection_close_error_code<errorcode>

    Counter

    A collection of counters that are lazily initialized to record each QUIC connection close’s error code.

    upstream.<tx/rx>.quic_reset_stream_error_code<error_code>

    Counter

    A collection of counters that are lazily initialized to record each QUIC stream reset error code.

    If health check is configured, the cluster has an additional statistics tree rooted at cluster.<name>.health_check. with the following statistics:

    Name

    Type

    Description

    attempt

    Counter

    Number of health checks

    success

    Counter

    Number of successful health checks

    failure

    Counter

    Number of immediately failed health checks (e.g. HTTP 503) as well as network failures

    passive_failure

    Counter

    Number of health check failures due to passive events (e.g. x-envoy-immediate-health-check-fail)

    network_failure

    Counter

    Number of health check failures due to network error

    verify_cluster

    Counter

    Number of health checks that attempted cluster name verification

    healthy

    Gauge

    Number of healthy members

    If outlier detection is configured for a cluster, statistics will be rooted at cluster.<name>.outlier_detection. and contain the following:

    Name

    Type

    Description

    ejections_enforced_total

    Counter

    Number of enforced ejections due to any outlier type

    ejections_active

    Gauge

    Number of currently ejected hosts

    ejections_overflow

    Counter

    Number of ejections aborted due to the max ejection %

    ejections_enforced_consecutive_5xx

    Counter

    Number of enforced consecutive 5xx ejections

    ejections_detected_consecutive_5xx

    Counter

    Number of detected consecutive 5xx ejections (even if unenforced)

    ejections_enforced_success_rate

    Counter

    Number of enforced success rate outlier ejections. Exact meaning of this counter depends on config item. Refer to Outlier Detection documentation for details.

    ejections_detected_success_rate

    Counter

    Number of detected success rate outlier ejections (even if unenforced). Exact meaning of this counter depends on config item. Refer to Outlier Detection documentation for details.

    ejections_enforced_consecutive_gateway_failure

    Counter

    Number of enforced consecutive gateway failure ejections

    ejections_detected_consecutive_gateway_failure

    Counter

    Number of detected consecutive gateway failure ejections (even if unenforced)

    ejections_enforced_consecutive_local_origin_failure

    Counter

    Number of enforced consecutive local origin failure ejections

    ejections_detected_consecutive_local_origin_failure

    Counter

    Number of detected consecutive local origin failure ejections (even if unenforced)

    ejections_enforced_local_origin_success_rate

    Counter

    Number of enforced success rate outlier ejections for locally originated failures

    ejections_detected_local_origin_success_rate

    Counter

    Number of detected success rate outlier ejections for locally originated failures (even if unenforced)

    ejections_enforced_failure_percentage

    Counter

    Number of enforced failure percentage outlier ejections. Exact meaning of this counter depends on config item. Refer to Outlier Detection documentation for details.

    ejections_detected_failure_percentage

    Counter

    Number of detected failure percentage outlier ejections (even if unenforced). Exact meaning of this counter depends on config item. Refer to Outlier Detection documentation for details.

    ejections_enforced_failure_percentage_local_origin

    Counter

    Number of enforced failure percentage outlier ejections for locally originated failures

    ejections_detected_failure_percentage_local_origin

    Counter

    Number of detected failure percentage outlier ejections for locally originated failures (even if unenforced)

    ejections_total

    Counter

    Deprecated. Number of ejections due to any outlier type (even if unenforced)

    ejections_consecutive_5xx

    Counter

    Deprecated. Number of consecutive 5xx ejections (even if unenforced)

    Circuit breakers statistics

    Circuit breakers statistics will be rooted at cluster.<name>.circuit_breakers.<priority>. and contain the following:

    Note

    Metrics starting with prefix are not generated by default. To track the number of resources remaining until a circuit breaker opens, set the parameter to true in circuit breaker configuration.

    If is turned on, statistics will be added to cluster.<name> and contain the following:

    Name

    Type

    Description

    upstream_rq_timeout_budget_percent_used

    Histogram

    What percentage of the global timeout was used waiting for a response

    upstream_rq_timeout_budget_per_try_percent_used

    Histogram

    What percentage of the per try timeout was used waiting for a response

    If HTTP is used, dynamic HTTP response code statistics are also available. These are emitted by various internal systems as well as some filters such as the router filter and . They are rooted at cluster.<name>. and contain the following statistics:

    Name

    Type

    Description

    upstreamrq_completed

    Counter

    Total upstream requests completed

    upstream_rq<xx>

    Counter

    Aggregate HTTP response codes (e.g., 2xx, 3xx, etc.)

    upstreamrq<>

    Counter

    Specific HTTP response codes (e.g., 201, 302, etc.)

    upstreamrq_time

    Histogram

    Request time milliseconds

    canary.upstream_rq_completed

    Counter

    Total upstream canary requests completed

    canary.upstream_rq<xx>

    Counter

    Upstream canary aggregate HTTP response codes

    canary.upstreamrq<>

    Counter

    Upstream canary specific HTTP response codes

    canary.upstreamrq_time

    Histogram

    Upstream canary request time milliseconds

    internal.upstream_rq_completed

    Counter

    Total internal origin requests completed

    internal.upstream_rq<xx>

    Internal origin aggregate HTTP response codes

    internal.upstreamrq<>

    Counter

    Internal origin specific HTTP response codes

    internal.upstreamrq_time

    Histogram

    Internal origin request time milliseconds

    external.upstream_rq_completed

    Counter

    Total external origin requests completed

    external.upstream_rq<xx>

    Counter

    External origin aggregate HTTP response codes

    external.upstreamrq<>

    Counter

    External origin specific HTTP response codes

    external.upstream_rq_time

    Histogram

    External origin request time milliseconds

    If TLS is used by the cluster the following statistics are rooted at cluster.<name>.ssl.:

    Name

    Type

    Description

    connection_error

    Counter

    Total TLS connection errors not including failed certificate verifications

    handshake

    Counter

    Total successful TLS connection handshakes

    session_reused

    Counter

    Total successful TLS session resumptions

    no_certificate

    Counter

    Total successful TLS connections with no client certificate

    fail_verify_no_cert

    Counter

    Total TLS connections that failed because of missing client certificate

    fail_verify_error

    Counter

    Total TLS connections that failed CA verification

    fail_verify_san

    Counter

    Total TLS connections that failed SAN verification

    fail_verify_cert_hash

    Counter

    Total TLS connections that failed certificate pinning verification

    ocsp_staple_failed

    Counter

    Total TLS connections that failed compliance with the OCSP policy

    ocsp_staple_omitted

    Counter

    Total TLS connections that succeeded without stapling an OCSP response

    ocsp_staple_responses

    Counter

    Total TLS connections where a valid OCSP response was available (irrespective of whether the client requested stapling)

    ocsp_staple_requests

    Counter

    Total TLS connections where the client requested an OCSP staple

    ciphers.<cipher>

    Counter

    Total successful TLS connections that used cipher <cipher>

    curves.<curve>

    Counter

    Total successful TLS connections that used ECDHE curve <curve>

    sigalgs.<sigalg>

    Counter

    Total successful TLS connections that used signature algorithm <sigalg>

    versions.<version>

    Counter

    Total successful TLS connections that used protocol version <version>

    The following TCP statistics, which are available when using the TCP stats transport socket, are rooted at cluster.<name>.tcp_stats.:

    Note

    These metrics are provided by the operating system. Due to differences in operating system metrics available and the methodology used to take measurements, the values may not be consistent across different operating systems or versions of the same operating system.

    Name

    Type

    Description

    cx_tx_segments

    Counter

    Total TCP segments transmitted

    cx_rx_segments

    Counter

    Total TCP segments received

    cx_tx_data_segments

    Counter

    Total TCP segments with a non-zero data length transmitted

    cx_rx_data_segments

    Counter

    Total TCP segments with a non-zero data length received

    cx_tx_retransmitted_segments

    Counter

    Total TCP segments retransmitted

    cx_tx_unsent_bytes

    Gauge

    Bytes which Envoy has sent to the operating system which have not yet been sent

    cx_tx_unacked_segments

    Gauge

    Segments which have been transmitted that have not yet been acknowledged

    cx_tx_percent_retransmitted_segments

    Histogram

    Percent of segments on a connection which were retransmistted

    cx_rtt_us

    Histogram

    Smoothed round trip time estimate in microseconds

    cx_rtt_variance_us

    Histogram

    Estimated variance in microseconds of the round trip time. Higher values indicated more variability.

    Alternate tree dynamic HTTP statistics

    If alternate tree statistics are configured, they will be present in the cluster.<name>.<alt name>. namespace. The statistics produced are the same as documented in the dynamic HTTP statistics section .

    If the service zone is available for the local service (via ) and the upstream cluster, Envoy will track the following statistics in cluster.<name>.zone.<from_zone>.<to_zone>. namespace.

    Load balancer statistics

    Statistics for monitoring load balancer decisions. Stats are rooted at cluster.<name>. and contain the following statistics:

    Name

    Type

    Description

    lb_recalculate_zone_structures

    Counter

    The number of times locality aware routing structures are regenerated for fast decisions on upstream locality selection

    lb_healthy_panic

    Counter

    Total requests load balanced with the load balancer in panic mode

    lb_zone_cluster_too_small

    Counter

    No zone aware routing because of small upstream cluster size

    lb_zone_routing_all_directly

    Counter

    Sending all requests directly to the same zone

    lb_zone_routing_sampled

    Counter

    Sending some requests to the same zone

    lb_zone_routing_cross_zone

    Counter

    Zone aware routing mode but have to send cross zone

    lb_local_cluster_not_ok

    Counter

    Local host set is not set or it is panic mode for local cluster

    lb_zone_number_differs

    Counter

    Number of zones in local and upstream cluster different

    lb_zone_no_capacity_left

    Counter

    Total number of times ended with random zone selection due to rounding error

    original_dst_host_invalid

    Counter

    Total number of invalid hosts passed to original destination load balancer

    Load balancer subset statistics

    Statistics for monitoring decisions. Stats are rooted at cluster.<name>. and contain the following statistics:

    Name

    Type

    Description

    lb_subsets_active

    Gauge

    Number of currently available subsets

    lb_subsets_created

    Counter

    Number of subsets created

    lb_subsets_removed

    Counter

    Number of subsets removed due to no hosts

    lb_subsets_selected

    Counter

    Number of times any subset was selected for load balancing

    lb_subsets_fallback

    Counter

    Number of times the fallback policy was invoked

    lb_subsets_fallback_panic

    Counter

    Number of times the subset panic mode triggered

    lb_subsets_single_host_per_subset_duplicate

    Gauge

    Number of duplicate (unused) hosts when using single_host_per_subset

    Ring hash load balancer statistics

    Statistics for monitoring the size and effective distribution of hashes when using the . Stats are rooted at cluster.<name>.ring_hash_lb. and contain the following statistics:

    Name

    Type

    Description

    size

    Gauge

    Total number of host hashes on the ring

    min_hashes_per_host

    Gauge

    Minimum number of hashes for a single host

    max_hashes_per_host

    Gauge

    Maximum number of hashes for a single host

    Statistics for monitoring effective host weights when using the Maglev load balancer. Stats are rooted at cluster.<name>.maglev_lb. and contain the following statistics:

    Name

    Type

    Description

    min_entries_per_host

    Gauge

    Minimum number of entries for a single host

    max_entries_per_host

    Gauge

    Maximum number of entries for a single host

    If request response size statistics are tracked, statistics will be added to cluster.<name> and contain the following: