The simplest way to see the available metrics is to cURL the metrics endpoint . The format is described .
Follow the Prometheus getting started doc to spin up a Prometheus server to collect etcd metrics.
The naming of metrics follows the suggested . A metric name has an etcd
or etcd_debugging
prefix as its namespace and a subsystem prefix (for example wal
and etcdserver
).
The metrics under the etcd
prefix are for monitoring and alerting. They are stable high level metrics. If there is any change of these metrics, it will be included in release notes.
Metrics that are etcd2 related are documented v2 metrics guide.
These metrics describe the status of the etcd server. In order to detect outages or problems for troubleshooting, the server metrics of every production etcd cluster should be closely monitored.
All these metrics are prefixed with etcd_server_
has_leader
indicates whether the member has a leader. If a member does not have a leader, it is totally unavailable. If all the members in the cluster do not have any leader, the entire cluster is totally unavailable.
proposals_committed_total
records the total number of consensus proposals committed. This gauge should increase over time if the cluster is healthy. Several healthy members of an etcd cluster may have different total committed proposals at once. This discrepancy may be due to recovering from peers after starting, lagging behind the leader, or being the leader and therefore having the most commits. It is important to monitor this metric across all the members in the cluster; a consistently large lag between a single member and its leader indicates that member is slow or unhealthy.
proposals_applied_total
records the total number of consensus proposals applied. The etcd server applies every committed proposal asynchronously. The difference between and proposals_applied_total
should usually be small (within a few thousands even under high load). If the difference between them continues to rise, it indicates that the etcd server is overloaded. This might happen when applying expensive queries like heavy range queries or large txn operations.
proposals_pending
indicates how many proposals are queued to commit. Rising pending proposals suggests there is a high client load or the member cannot commit proposals.
proposals_failed_total
are normally related to two issues: temporary failures related to a leader election or longer downtime caused by a loss of quorum in the cluster.
These metrics describe the status of the disk operations.
All these metrics are prefixed with etcd_disk_
.
A wal_fsync
is called when etcd persists its log entries to disk before applying them.
A backend_commit
is called when etcd commits an incremental snapshot of its most recent changes to disk.
These metrics describe the status of the network.
All these metrics are prefixed with etcd_network_
peer_sent_bytes_total
counts the total number of bytes sent to a specific peer. Usually the leader member sends more data than other members since it is responsible for transmitting replicated data.
peer_received_bytes_total
counts the total number of bytes received from a specific peer. Usually follower members receive data only from the leader member.
These metrics are exposed via .
The metrics under the etcd_debugging
prefix are for debugging. They are very implementation dependent and volatile. They might be changed or removed without any warning in new etcd releases. Some of the metrics might be moved to the etcd
prefix when they become more stable.
Abnormally high snapshot duration (snapshot_save_total_duration_seconds
) indicates disk issues and might cause the cluster to be unstable.
The Prometheus client library provides a number of metrics under the go
and process
namespaces. There are a few that are particularly interesting.