The simplest way to see the available metrics is to cURL the metrics endpoint . The format is described here.
Follow the to spin up a Prometheus server to collect etcd metrics.
The naming of metrics follows the suggested Prometheus best practices. A metric name has an etcd
or etcd_debugging
prefix as its namespace and a subsystem prefix (for example wal
and etcdserver
).
The metrics under the etcd
prefix are for monitoring and alerting. They are stable high level metrics. If there is any change of these metrics, it will be included in release notes.
These metrics describe the serving of requests (non-watch events) served by etcd members in non-proxy mode: total incoming requests, request failures and processing latency (inc. raft rounds for storage). They are useful for tracking user-generated traffic hitting the etcd cluster .
All these metrics are prefixed with etcd_http_
Example Prometheus queries that may be useful from these metrics (across all etcd members):
sum(rate(etcd_http_failed_total{job="etcd"}[1m]) by (method) / sum(rate(etcd_http_events_received_total{job="etcd"})[1m]) by (method)
Shows the fraction of events that failed by HTTP method across all members, across a time window of
1m
.-
Show the 0.90-tile latency (in seconds) of read/write (respectively) event handling across all members, with a window of
5m
.
proxy
etcd members operating in proxy mode do not directly perform store operations. They forward all requests to cluster instances.
Tracking the rate of requests coming from a proxy allows one to pin down which machine is performing most reads/writes.
All these metrics are prefixed with
Name | Description | Type |
---|---|---|
requests_total | Total number of requests by this proxy instance. | Counter(method) |
handled_total | Total number of fully handled requests, with responses from etcd members. | Counter(method) |
dropped_total | Total number of dropped requests due to forwarding errors to etcd members. | Counter(method,error) |
handling_duration_seconds | Bucketed handling times by HTTP method, including round trip to member instances. | Histogram(method) |
Example Prometheus queries that may be useful from these metrics (across all etcd servers):
sum(rate(etcd_proxy_handled_total{job="etcd"}[1m])) by (method)
Rate of requests (by HTTP method) handled by all proxies, across a window of
1m
.histogram_quantile(0.9, sum(rate(handling_duration_seconds{job="etcd",method="GET"}[5m])) by (le))
histogram_quantile(0.9, sum(rate(handling_duration_seconds{job="etcd",method!="GET"}[5m])) by (le))
Show the 0.90-tile latency (in seconds) of handling of user requests across all proxy machines, with a window of
5m
.sum(rate(etcd_proxy_dropped_total{job="etcd"}[1m])) by (proxying_error)
Number of failed request on the proxy. This should be 0, spikes here indicate connectivity issues to the etcd cluster.
Proposal duration (proposal_duration_seconds
) provides a proposal commit latency histogram. The reported latency reflects network and disk IO delays in etcd.
Proposals pending (proposals_pending
) indicates how many proposals are queued for commit. Rising pending proposals suggests there is a high client load or the cluster is unstable.
Failed proposals (proposals_failed_total
) are normally related to two issues: temporary failures related to a leader election or longer duration downtime caused by a loss of quorum in the cluster.
wal
Name | Description | Type |
---|---|---|
fsync_duration_seconds | The latency distributions of fsync called by wal | Histogram |
last_index_saved | The index of the last entry saved by wal | Gauge |
Abnormally high fsync duration (fsync_duration_seconds
) indicates disk issues and might cause the cluster to be unstable.
Abnormally high snapshot duration (snapshot_save_total_duration_seconds
) indicates disk issues and might cause the cluster to be unstable.
rafthttp
Name | Description | Type | Labels |
---|---|---|---|
message_sent_latency_seconds | The latency distributions of messages sent | HistogramVec | sendingType, msgType, remoteID |
message_sent_failed_total | The total number of failed messages sent | Summary | sendingType, msgType, remoteID |
Abnormally high message duration (message_sent_latency_seconds
) indicates network issues and might cause the cluster to be unstable.
An increase in message failures () indicates more severe network issues and might cause the cluster to be unstable.
Label sendingType
is the connection type to send messages. message
, msgapp
and msgappv2
use HTTP streaming, while pipeline
does HTTP request for each message.
Label msgType
is the type of raft message. MsgApp
is log replication messages; MsgSnap
is snapshot install messages; MsgProp
is proposal forward messages; the others maintain internal raft status. Given large snapshots, a lengthy msgSnap transmission latency should be expected. For other types of messages, given enough network bandwidth, latencies comparable to ping latency should be expected.
Label remoteID
is the member ID of the message destination.
The Prometheus client library provides a number of metrics under the go
and process
namespaces. There are a few that are particularly interesting.