Maintenance Tools - Metric Tool - 《Apache IoTDB User Guide (V0.13.x)》

3. Who will use metrics?

Belows are some typical application scenarios

System is running slowly

When system is running slowly, we always hope to have information about system’s running status as detail as possible, such as
- JVM：Is there FGC？How long does it cost？How much does the memory usage decreased after GC？Are there lots of threads？
- System：Is the CPU usage too hi？Are there many disk IOs？
- Connections：How many connections are there in the current time？
- Interface：What is the TPS and latency of every interface？
- ThreadPool：Are there many pending tasks？
- Cache Hit Ratio
No space left on device

When meet a “no space left on device” error, we really want to know which kind of data file had a rapid rise in the past hours.
Is the system running in abnormal status

We could use the count of error logs、the alive status of nodes in cluster, etc, to determine whether the system is running abnormally.

Any person cares about the system’s status, including but not limited to RD, QA, SRE, DBA, can use the metrics to work more efficiently.

For now, we have provided some metrics for several core modules of IoTDB, and more metrics will be added or updated along with the development of new features and optimization or refactoring of architecture.

Before step into next, we’d better stop to have a look into some key concepts about metrics.

Metric Name

The name of this metric，for example, indicates the total count of log events。
Tag

Each metric could have 0 or several sub classes (Tag), for the same example, the logback_events_total metric has a sub class named level, which means the total count of log events at the specific level

4.2. Data Format

IoTDB provides metrics data both in JMX and Prometheus format. For JMX, you can get these metrics via org.apache.iotdb.metrics.

Next, we will choose Prometheus format data as samples to describe each kind of metric.

4.3.1. API

4.3.2. Task

Metric	Tag	level	Description	Sample
queue	name=”compaction_inner/compaction_cross/flush”, status=”running/waiting”	important	The count of current tasks in running and waiting status	queue{name=”flush”,status=”waiting”,} 0.0 queue{name=”flush”,status=”running”,} 0.0
cost_task_seconds_count	name=”compaction/flush”	important	The total count of tasks occurs till now	cost_task_seconds_count{name=”flush”,} 1.0
cost_task_seconds_max	name=”compaction/flush”	important	The seconds of the longest task takes till now	cost_task_seconds_max{name=”flush”,} 0.363
cost_task_seconds_sum	name=”compaction/flush”	important	The total cost seconds of all tasks till now	cost_task_seconds_sum{name=”flush”,} 0.363
data_written	name=”compaction”, type=”aligned/not-aligned/total”	important	The size of data written in compaction	data_written{name=”compaction”,type=”total”,} 10240
data_read	name=”compaction”	important	The size of data read in compaction	data_read={name=”compaction”,} 10240

4.3.3. Memory Usage

Metric	Tag	level	Description	Sample
mem	name=”chunkMetaData/storageGroup/mtree”	important	Current memory size of chunkMetaData/storageGroup/mtree data in bytes	mem{name=”chunkMetaData”,} 2050.0

4.3.4. Cache Hit Ratio

Metric	Tag	level	Description	Sample
cache_hit	name=”chunk/timeSeriesMeta/bloomFilter”	important	Cache hit ratio of chunk/timeSeriesMeta and prevention ratio of bloom filter	cache_hit{name=”chunk”,} 80

4.3.5. Business Data

4.3.6. Cluster

Metric	Tag	level	Description	Sample
cluster_node_leader_count	name=””	important	The count of `dataGroupLeader` on each node, which reflects the distribution of leaders	cluster_node_leader_count{name=”127.0.0.1”,} 2.0
cluster_uncommitted_log	name=””	important	The count of `uncommitted_log` on each node in data groups it belongs to	cluster_uncommitted_log{name=”127.0.0.1_Data-127.0.0.1-40010-raftId-0”,} 0.0
cluster_node_status	name=””	important	The current node status, 1=online 2=offline	cluster_node_status{name=”127.0.0.1”,} 1.0
cluster_elect_total	name=””,status=”fail/win”	important	The count and result (won or failed) of elections the node participated in.	cluster_elect_total{name=”127.0.0.1”,status=”win”,} 1.0

4.4. IoTDB PreDefined Metrics Set

Users can modify the value of predefinedMetrics in the iotdb-metric.yml file to enable the predefined set of metrics，now support JVM, LOGBACK, FILE, , SYSYTEM.

4.4.1. JVM

4.4.1.1. Threads

Metric	Tag	level	Description	Sample
jvm_threads_live_threads	None	Important	The current count of threads	jvm_threads_live_threads 25.0
jvm_threads_daemon_threads	None	Important	The current count of daemon threads	jvm_threads_daemon_threads 12.0
jvm_threads_peak_threads	None	Important	The max count of threads till now	jvm_threads_peak_threads 28.0
jvm_threads_states_threads	state=”runnable/blocked/waiting/timed-waiting/new/terminated”	Important	The count of threads in each status	jvm_threads_states_threads{state=”runnable”,} 10.0

4.4.1.2. GC

Metric	Tag	level	Description	Sample
jvm_gc_pause_seconds_count	action=”end of major GC/end of minor GC”,cause=”xxxx”	Important	The total count of YGC/FGC events and its cause	jvm_gc_pause_seconds_count{action=”end of major GC”,cause=”Metadata GC Threshold”,} 1.0
jvm_gc_pause_seconds_sum	action=”end of major GC/end of minor GC”,cause=”xxxx”	Important	The total cost seconds of YGC/FGC and its cause	jvm_gc_pause_seconds_sum{action=”end of major GC”,cause=”Metadata GC Threshold”,} 0.03
jvm_gc_pause_seconds_max	action=”end of major GC”,cause=”Metadata GC Threshold”	Important	The max cost seconds of YGC/FGC till now and its cause	jvm_gc_pause_seconds_max{action=”end of major GC”,cause=”Metadata GC Threshold”,} 0.0
jvm_gc_memory_promoted_bytes_total	None	Important	Count of positive increases in the size of the old generation memory pool before GC to after GC	jvm_gc_memory_promoted_bytes_total 8425512.0
jvm_gc_max_data_size_bytes	None	Important	Max size of long-lived heap memory pool	jvm_gc_max_data_size_bytes 2.863661056E9
jvm_gc_live_data_size_bytes	None	Important	Size of long-lived heap memory pool after reclamation	jvm_gc_live_data_size_bytes 8450088.0
jvm_gc_memory_allocated_bytes_total	None	Important	Incremented for an increase in the size of the (young) heap memory pool after one GC to before the next	jvm_gc_memory_allocated_bytes_total 4.2979144E7

4.4.1.3. Memory

4.4.1.4. Classes

Metric	Tag	level	Description	Sample
jvm_classes_unloaded_classes_total	None	Important	The total number of classes unloaded since the Java virtual machine has started execution	jvm_classes_unloaded_classes_total 680.0
jvm_classes_loaded_classes	None	Important	The number of classes that are currently loaded in the Java virtual machine	jvm_classes_loaded_classes 5975.0
jvm_compilation_time_ms_total	{compiler=”HotSpot 64-Bit Tiered Compilers”,}	Important	The approximate accumulated elapsed time spent in compilation	jvm_compilation_time_ms_total{compiler=”HotSpot 64-Bit Tiered Compilers”,} 107092.0

4.4.2. File

Metric	Tag	level	Description	Sample
file_size	name=”wal/seq/unseq”	important	The current file size of wal/seq/unseq in bytes	file_size{name=”wal”,} 67.0
file_count	name=”wal/seq/unseq”	important	The current count of wal/seq/unseq files	file_count{name=”seq”,} 1.0

4.4.3. Logback

Metric	Tag	level	Description	示例
logback_events_total	{level=”trace/debug/info/warn/error”,}	Important	The count of trace/debug/info/warn/error log events till now	logback_events_total{level=”warn”,} 0.0

4.4.4. Process

4.4.5. System

Metric	Tag	level	Description	示例
sys_cpu_load	name=”cpu”	core	current system CPU Usage(%)	sys_cpu_load{name=”system”,} 15.0
sys_cpu_cores	name=”cpu”	core	available CPU cores	sys_cpu_cores{name=”system”,} 16.0
sys_total_physical_memory_size	name=”memory”	core	Maximum physical memory of system	sys_total_physical_memory_size{name=”system”,} 1.5950999552E10
sys_free_physical_memory_size	name=”memory”	core	The current available memory of system	sys_free_physical_memory_size{name=”system”,} 4.532396032E9
sys_total_swap_space_size	name=”memory”	core	The maximum swap area of system	sys_total_swap_space_size{name=”system”,} 2.1051273216E10
sys_free_swap_space_size	name=”memory”	core	The available swap area of system	sys_free_swap_space_size{name=”system”,} 2.931576832E9
sys_committed_vm_size	name=”memory”	important	the amount of virtual memory available to running processes	sys_committed_vm_size{name=”system”,} 5.04344576E8
sys_disk_total_space	name=”disk”	core	The total disk space	sys_disk_total_space{name=”system”,} 5.10770798592E11
sys_disk_free_space	name=”disk”	core	The available disk space	sys_disk_free_space{name=”system”,} 3.63467845632E11

If you want to add your own metrics data in IoTDB, please see the [IoTDB Metric Framework] (https://github.com/apache/iotdb/tree/master/metrics) document.
Metric embedded point definition rules
- Metric: The name of the monitoring item. For example, entry_seconds_count is the cumulative number of accesses to the interface, and file_size is the total number of files.
Monitoring indicator level meaning:
- The default startup level for online operation is Important level, the default startup level for offline debugging is Normal level, and the audit strictness is
- Core: The core indicator of the system, used by the operation and maintenance personnel, which is related to the performance, stability, and security** of the system, such as the status of the instance, the load of the system, etc.
- Important: An important indicator of the module, which is used by operation and maintenance and testers, and is directly related to the running status of each module, such as the number of merged files, execution status, etc.
- Normal: General indicators of the module, used by developers to facilitate locating the module when problems occur, such as specific key operation situations in the merger.
- All: All indicators of the module, used by module developers, often used when the problem is reproduced, so as to solve the problem quickly.

The metrics collection switch is disabled by default，you need to enable it from conf/iotdb-metric.yml, Currently, it also supports hot loading via load configuration after startup.

5.1. Iotdb-metric.yml

Then you can get metrics data as follows

Enable metrics switch in iotdb-metric.yml
You can just stay other config params as default.
Start/Restart your IoTDB server/cluster
Open your browser or use the curl command to request http://servier_ip:9091/metrics，then you will get metrics data like follows:

As above descriptions，IoTDB provides metrics data in standard Prometheus format，so we can integrate with Prometheus and Grafana directly.

Along with running, IoTDB will collect its metrics continuously.
Prometheus scrapes metrics from IoTDB at a constant interval (can be configured).
Prometheus saves these metrics to its inner TSDB.
Grafana queries metrics from Prometheus at a constant interval (can be configured) and then presents them on the graph.

So, we need to do some additional works to configure and deploy Prometheus and Grafana.

For instance, you can config your Prometheus as follows to get metrics data from IoTDB:

The following documents may help you have a good journey with Prometheus and Grafana.

Prometheus getting_started (opens new window)

Grafana getting_started (opens new window)

5.3. Apache IoTDB Dashboard

We provide the Apache IoTDB Dashboard, and the rendering shown in Grafana is as follows:

You can obtain the json files of Dashboards corresponding to different iotdb versions in the grafana-metrics-example folder.
You can visit , search for Apache IoTDB Dashboard and use

When creating Grafana, you can select the json file you just downloaded to Import and select the corresponding target data source for Apache IoTDB Dashboard.