InfluxDB storage engine

Data is safely written to disk
Data is accurate (first) and performant (second)

This document outlines the internal workings of the storage engine. This information is presented both as a reference and to aid those looking to maximize performance.

The storage engine includes the following components:

Write Ahead Log (WAL)
Time-Structed Merge Tree (TSM)

The storage engine handles data from the point an API write request is received through writing data to the physical disk. Data is written to InfluxDB using line protocol sent via HTTP POST request to the endpoint. Batches of are sent to InfluxDB, compressed, and written to a WAL for immediate durability. Points are also written to an in-memory cache and become immediately queryable. The in-memory cache is periodically written to disk in the form of TSM files. As TSM files accumulate, the storage engine combines and compacts accumulated them into higher level TSM files.

While points can be sent individually, for efficiency, most applications send points in batches. Points in a POST body can be from an arbitrary number of series, measurements, and tag sets. Points in a batch do not have to be from the same measurement or tagset.

When the storage engine receives a write request, the following steps occur:

The write request is appended to the end of the WAL file.
Data is written to disk using fsync().
When data is successfully written to disk, a response confirms the write request was successful.

takes the file and pushes pending writes all the way to the disk. As a system call, fsync() has a kernel context switch that’s computationally expensive, but guarantees that data is safe on disk.

When the storage engine restarts, the WAL file is read back into the in-memory database. InfluxDB then answers requests to the endpoint.

The cache is an in-memory copy of data points currently stored in the WAL. The cache:

Organizes points by key (measurement, tag set, and unique field) Each field is stored in its own time-ordered range.
Stores uncompressed data.
Gets updates from the WAL each time the storage engine restarts. The cache is queried at runtime and merged with the data stored in TSM files.
Uses a maximum maxSize bytes of memory.

To efficiently compact and store data, the storage engine groups field values by series key, and then orders those field values by time. (A is defined by measurement, tag key and value, and field key.)

The storage engine uses a Time-Structured Merge Tree (TSM) data format. TSM files store compressed series data in a columnar format. To improve efficiency, the storage engine only stores differences (or deltas) between values in a series. Column-oriented storage lets the engine read by series key and omit extraneous data.

After fields are stored safely in TSM files, the WAL is truncated and the cache is cleared. The TSM compaction code is quite complex. However, the high-level goal is quite simple: organize values for a series together into long runs to best optimize compression and scanning queries.

For more information on the TSM engine, watch the video below:

What measurements, tags, fields exist? (This happens in meta queries.)
Given a measurement, tags, and fields, what series keys exist?

internals

Storage engine

InfluxDB storage engine

Related