监控 DM 集群

来源 1 浏览 1164 扫码打印 2019-06-29 10:12:59

DM 监控指标

overview 下包含运行当前选定 task 的所有 DM-worker instance 的部分监控指标。当前默认告警规则只针对于单个 DM-worker instance。

task 状态

metric 名称	说明	告警说明
task state	同步子任务的状态	当子任务状态处于 paused 超过 10 分钟时

Dumper

下面 metrics 仅在 task-mode 为 full 或者 all 模式下会有值。

metric 名称	说明	告警说明
dump process exits with error	dumper 在 DM-worker 内部遇到错误并且退出了	立即告警

Binlog replication

下面 metrics 仅在 task-mode 为 incremental 或者 all 模式下会有值。

metric 名称	说明	告警说明
remaining time to sync	预计 syncer 还需要多少分钟可以和 master 完全同步，单位：分钟	N/A
replicate lag	master 到 syncer 的 binlog 复制延迟时间，单位：秒	N/A
process exist with error	binlog replication 在 DM-worker 内部遇到错误并且退出了	立即告警
binlog file gap between master and syncer	与上游 master 相比落后的 binlog file 个数	落后 binlog file 个数超过 1 个（不含 1 个）且持续 10 分钟时
binlog file gap between relay and syncer	与 relay 相比落后的 binlog file 个数	落后 binlog file 个数超过 1 个（不含 1 个）且持续 10 分钟时
binlog event qps	单位时间内接收到的 binlog event 数量 (不包含需要跳过的 event)	N/A
skipped binlog event qps	单位时间内接收到的需要跳过的 binlog event 数量	N/A
cost of binlog event transform	syncer 解析并且转换 binlog 成 SQLs 的耗时，单位：秒	N/A
total sqls jobs	单位时间内新增的 job 数量	N/A
finished sqls jobs	单位时间内完成的 job 数量	N/A
execution latency	syncer 执行 transaction 到下游的耗时，单位：秒	N/A
unsynced tables	当前子任务内还未收到 shard DDL 的分表数量	N/A
shard lock resolving	当前子任务是否正在等待 shard DDL 同步，大于 0 表示正在等待同步	N/A

Instance

在 Grafana dashboard 中，instance 的默认名称为。

task

metric 名称	说明	告警说明
task state	同步子任务的状态	当子任务状态处于 paused 超过 10 分钟时
load progress	loader 导入过程的进度百分比，值变化范围为：0% - 100%	N/A
binlog file gap between master and syncer	与上游 master 相比 binlog replication 落后的 binlog file 个数	N/A
shard lock resolving	当前子任务是否正在等待 shard DDL 同步，大于 0 表示正在等待同步	N/A

本文档使用全库网构建