Belows are some typical application scenarios
System is running slowly
When system is running slowly, we always hope to have information about system’s running status as detail as possible, such as
- JVM:Is there FGC?How long does it cost?How much does the memory usage decreased after GC?Are there lots of threads?
- System:Is the CPU usage too hi?Are there many disk IOs?
- Connections:How many connections are there in the current time?
- Interface:What is the TPS and latency of every interface?
- ThreadPool:Are there many pending tasks?
- Cache Hit Ratio
No space left on device
When meet a “no space left on device” error, we really want to know which kind of data file had a rapid rise in the past hours.
Is the system running in abnormal status
We could use the count of error logs、the alive status of nodes in cluster, etc, to determine whether the system is running abnormally.
Any person cares about the system’s status, including but not limited to RD, QA, SRE, DBA, can use the metrics to work more efficiently.
For now, we have provided some metrics for several core modules of IoTDB, and more metrics will be added or updated along with the development of new features and optimization or refactoring of architecture.
Before step into next, we’d better stop to have a look into some key concepts about metrics.
Metric Name
The name of this metric,for example, indicates the total count of log events。
Tag
Each metric could have 0 or several sub classes (Tag), for the same example, the
logback_events_total
metric has a sub class namedlevel
, which meansthe total count of log events at the specific level
4.2. Data Format
IoTDB provides metrics data both in JMX and Prometheus format. For JMX, you can get these metrics via org.apache.iotdb.metrics
.
Next, we will choose Prometheus format data as samples to describe each kind of metric.
4.3.1. API
4.3.2. Task
Metric | Tag | level | Description | Sample |
---|---|---|---|---|
queue | name=”compaction_inner/compaction_cross/flush”, status=”running/waiting” | important | The count of current tasks in running and waiting status | queue{name=”flush”,status=”waiting”,} 0.0 queue{name=”flush”,status=”running”,} 0.0 |
cost_task_seconds_count | name=”compaction/flush” | important | The total count of tasks occurs till now | cost_task_seconds_count{name=”flush”,} 1.0 |
cost_task_seconds_max | name=”compaction/flush” | important | The seconds of the longest task takes till now | cost_task_seconds_max{name=”flush”,} 0.363 |
cost_task_seconds_sum | name=”compaction/flush” | important | The total cost seconds of all tasks till now | cost_task_seconds_sum{name=”flush”,} 0.363 |
data_written | name=”compaction”, type=”aligned/not-aligned/total” | important | The size of data written in compaction | data_written{name=”compaction”,type=”total”,} 10240 |
data_read | name=”compaction” | important | The size of data read in compaction | data_read={name=”compaction”,} 10240 |
4.3.3. Memory Usage
Metric | Tag | level | Description | Sample |
---|---|---|---|---|
mem | name=”chunkMetaData/storageGroup/mtree” | important | Current memory size of chunkMetaData/storageGroup/mtree data in bytes | mem{name=”chunkMetaData”,} 2050.0 |
4.3.4. Cache Hit Ratio
Metric | Tag | level | Description | Sample |
---|---|---|---|---|
cache_hit | name=”chunk/timeSeriesMeta/bloomFilter” | important | Cache hit ratio of chunk/timeSeriesMeta and prevention ratio of bloom filter | cache_hit{name=”chunk”,} 80 |
4.3.5. Business Data
4.3.6. Cluster
Metric | Tag | level | Description | Sample |
---|---|---|---|---|
cluster_node_leader_count | name=”” | important | The count of dataGroupLeader on each node, which reflects the distribution of leaders | cluster_node_leader_count{name=”127.0.0.1”,} 2.0 |
cluster_uncommitted_log | name=”” | important | The count of uncommitted_log on each node in data groups it belongs to | cluster_uncommitted_log{name=”127.0.0.1_Data-127.0.0.1-40010-raftId-0”,} 0.0 |
cluster_node_status | name=”” | important | The current node status, 1=online 2=offline | cluster_node_status{name=”127.0.0.1”,} 1.0 |
cluster_elect_total | name=””,status=”fail/win” | important | The count and result (won or failed) of elections the node participated in. | cluster_elect_total{name=”127.0.0.1”,status=”win”,} 1.0 |
4.4. IoTDB PreDefined Metrics Set
Users can modify the value of predefinedMetrics
in the iotdb-metric.yml
file to enable the predefined set of metrics,now support JVM
, LOGBACK
, FILE
, , SYSYTEM
.
4.4.1. JVM
4.4.1.1. Threads
Metric | Tag | level | Description | Sample |
---|---|---|---|---|
jvm_threads_live_threads | None | Important | The current count of threads | jvm_threads_live_threads 25.0 |
jvm_threads_daemon_threads | None | Important | The current count of daemon threads | jvm_threads_daemon_threads 12.0 |
jvm_threads_peak_threads | None | Important | The max count of threads till now | jvm_threads_peak_threads 28.0 |
jvm_threads_states_threads | state=”runnable/blocked/waiting/timed-waiting/new/terminated” | Important | The count of threads in each status | jvm_threads_states_threads{state=”runnable”,} 10.0 |
4.4.1.2. GC
Metric | Tag | level | Description | Sample |
---|---|---|---|---|
jvm_gc_pause_seconds_count | action=”end of major GC/end of minor GC”,cause=”xxxx” | Important | The total count of YGC/FGC events and its cause | jvm_gc_pause_seconds_count{action=”end of major GC”,cause=”Metadata GC Threshold”,} 1.0 |
jvm_gc_pause_seconds_sum | action=”end of major GC/end of minor GC”,cause=”xxxx” | Important | The total cost seconds of YGC/FGC and its cause | jvm_gc_pause_seconds_sum{action=”end of major GC”,cause=”Metadata GC Threshold”,} 0.03 |
jvm_gc_pause_seconds_max | action=”end of major GC”,cause=”Metadata GC Threshold” | Important | The max cost seconds of YGC/FGC till now and its cause | jvm_gc_pause_seconds_max{action=”end of major GC”,cause=”Metadata GC Threshold”,} 0.0 |
jvm_gc_memory_promoted_bytes_total | None | Important | Count of positive increases in the size of the old generation memory pool before GC to after GC | jvm_gc_memory_promoted_bytes_total 8425512.0 |
jvm_gc_max_data_size_bytes | None | Important | Max size of long-lived heap memory pool | jvm_gc_max_data_size_bytes 2.863661056E9 |
jvm_gc_live_data_size_bytes | None | Important | Size of long-lived heap memory pool after reclamation | jvm_gc_live_data_size_bytes 8450088.0 |
jvm_gc_memory_allocated_bytes_total | None | Important | Incremented for an increase in the size of the (young) heap memory pool after one GC to before the next | jvm_gc_memory_allocated_bytes_total 4.2979144E7 |
4.4.1.3. Memory
4.4.1.4. Classes
Metric | Tag | level | Description | Sample |
---|---|---|---|---|
jvm_classes_unloaded_classes_total | None | Important | The total number of classes unloaded since the Java virtual machine has started execution | jvm_classes_unloaded_classes_total 680.0 |
jvm_classes_loaded_classes | None | Important | The number of classes that are currently loaded in the Java virtual machine | jvm_classes_loaded_classes 5975.0 |
jvm_compilation_time_ms_total | {compiler=”HotSpot 64-Bit Tiered Compilers”,} | Important | The approximate accumulated elapsed time spent in compilation | jvm_compilation_time_ms_total{compiler=”HotSpot 64-Bit Tiered Compilers”,} 107092.0 |
4.4.2. File
Metric | Tag | level | Description | Sample |
---|---|---|---|---|
file_size | name=”wal/seq/unseq” | important | The current file size of wal/seq/unseq in bytes | file_size{name=”wal”,} 67.0 |
file_count | name=”wal/seq/unseq” | important | The current count of wal/seq/unseq files | file_count{name=”seq”,} 1.0 |
4.4.3. Logback
Metric | Tag | level | Description | 示例 |
---|---|---|---|---|
logback_events_total | {level=”trace/debug/info/warn/error”,} | Important | The count of trace/debug/info/warn/error log events till now | logback_events_total{level=”warn”,} 0.0 |
4.4.4. Process
4.4.5. System
Metric | Tag | level | Description | 示例 |
---|---|---|---|---|
sys_cpu_load | name=”cpu” | core | current system CPU Usage(%) | sys_cpu_load{name=”system”,} 15.0 |
sys_cpu_cores | name=”cpu” | core | available CPU cores | sys_cpu_cores{name=”system”,} 16.0 |
sys_total_physical_memory_size | name=”memory” | core | Maximum physical memory of system | sys_total_physical_memory_size{name=”system”,} 1.5950999552E10 |
sys_free_physical_memory_size | name=”memory” | core | The current available memory of system | sys_free_physical_memory_size{name=”system”,} 4.532396032E9 |
sys_total_swap_space_size | name=”memory” | core | The maximum swap area of system | sys_total_swap_space_size{name=”system”,} 2.1051273216E10 |
sys_free_swap_space_size | name=”memory” | core | The available swap area of system | sys_free_swap_space_size{name=”system”,} 2.931576832E9 |
sys_committed_vm_size | name=”memory” | important | the amount of virtual memory available to running processes | sys_committed_vm_size{name=”system”,} 5.04344576E8 |
sys_disk_total_space | name=”disk” | core | The total disk space | sys_disk_total_space{name=”system”,} 5.10770798592E11 |
sys_disk_free_space | name=”disk” | core | The available disk space | sys_disk_free_space{name=”system”,} 3.63467845632E11 |
- If you want to add your own metrics data in IoTDB, please see the [IoTDB Metric Framework] (https://github.com/apache/iotdb/tree/master/metrics) document.
- Metric embedded point definition rules
Metric
: The name of the monitoring item. For example,entry_seconds_count
is the cumulative number of accesses to the interface, andfile_size
is the total number of files.
- Monitoring indicator level meaning:
- The default startup level for online operation is
Important
level, the default startup level for offline debugging isNormal
level, and the audit strictness is Core
: The core indicator of the system, used by the operation and maintenance personnel, which is related to the performance, stability, and security** of the system, such as the status of the instance, the load of the system, etc.Important
: An important indicator of the module, which is used by operation and maintenance and testers, and is directly related to the running status of each module, such as the number of merged files, execution status, etc.Normal
: General indicators of the module, used by developers to facilitate locating the module when problems occur, such as specific key operation situations in the merger.All
: All indicators of the module, used by module developers, often used when the problem is reproduced, so as to solve the problem quickly.
- The default startup level for online operation is
The metrics collection switch is disabled by default,you need to enable it from conf/iotdb-metric.yml
, Currently, it also supports hot loading via load configuration
after startup.
5.1. Iotdb-metric.yml
Then you can get metrics data as follows
- Enable metrics switch in
iotdb-metric.yml
- You can just stay other config params as default.
- Start/Restart your IoTDB server/cluster
- Open your browser or use the
curl
command to requesthttp://servier_ip:9091/metrics
,then you will get metrics data like follows:
As above descriptions,IoTDB provides metrics data in standard Prometheus format,so we can integrate with Prometheus and Grafana directly.
- Along with running, IoTDB will collect its metrics continuously.
- Prometheus scrapes metrics from IoTDB at a constant interval (can be configured).
- Prometheus saves these metrics to its inner TSDB.
- Grafana queries metrics from Prometheus at a constant interval (can be configured) and then presents them on the graph.
So, we need to do some additional works to configure and deploy Prometheus and Grafana.
For instance, you can config your Prometheus as follows to get metrics data from IoTDB:
The following documents may help you have a good journey with Prometheus and Grafana.
Prometheus getting_started (opens new window)
Grafana getting_started (opens new window)
5.3. Apache IoTDB Dashboard
We provide the Apache IoTDB Dashboard, and the rendering shown in Grafana is as follows:
- You can obtain the json files of Dashboards corresponding to different iotdb versions in the grafana-metrics-example folder.
- You can visit , search for
Apache IoTDB Dashboard
and use
When creating Grafana, you can select the json file you just downloaded to Import
and select the corresponding target data source for Apache IoTDB Dashboard.