Optimizing persistent storage

The following table lists the available persistent storage technologies for OKD.

Table 1. Available storage options
Storage type	Description	Examples
Block	Presented to the operating system (OS) as a block device Suitable for applications that need full control of storage and operate at a low level on files bypassing the file system Also referred to as a Storage Area Network (SAN) Non-shareable, which means that only one client at a time can mount an endpoint of this type	Containerized GlusterFS/External GlusterFS GlusterFS footnoteref:dynamicPV[Containerized GlusterFS/External GlusterFS GlusterFS, Ceph RBD, OpenStack Cinder, AWS EBS, Azure Disk, GCE persistent disk, and VMware vSphere support dynamic persistent volume (PV) provisioning natively in OKD.] iSCSI, Fibre Channel, Ceph RBD, OpenStack Cinder, AWS EBS footnoteref:dynamicPV[], Dell/EMC Scale.IO, VMware vSphere Volume, GCE Persistent Disk footnoteref:dynamicPV[], Azure Disk
File	Presented to the OS as a file system export to be mounted Also referred to as Network Attached Storage (NAS) Concurrency, latency, file locking mechanisms, and other capabilities vary widely between protocols, implementations, vendors, and scales.	Containerized GlusterFS/External GlusterFS GlusterFS footnoteref:dynamicPV[], RHEL NFS, NetApp NFS footnoteref:netappnfs[NetApp NFS supports dynamic PV provisioning when using the Trident plugin.] , Azure File, Vendor NFS, Vendor GlusterFS footnoteref:glusterfs[Vendor GlusterFS, Vendor S3, and Vendor Swift supportability and configurability may vary.], Azure File, AWS EFS
Object	Accessible through a REST API endpoint Configurable for use in the OKD Registry Applications must build their drivers into the application and/or container.	Containerized GlusterFS/External GlusterFS GlusterFS footnoteref:dynamicPV[], Ceph Object Storage (RADOS Gateway), OpenStack Swift, Aliyun OSS, AWS S3, Google Cloud Storage, Azure Blob Storage, Vendor S3 footnoteref:glusterfs[], Vendor Swift footnoteref:glusterfs[]

You can use Containerized GlusterFS GlusterFS (a hyperconverged or cluster-hosted storage solution) or External GlusterFS GlusterFS (an externally hosted storage solution) for block, file, and object storage for OKD registry, logging, and monitoring.

The following table summarizes the recommended and configurable storage technologies for the given OKD cluster application.

Table 2. Recommended and configurable storage technology
Storage type	RWO footnoteref:rwo[ReadWriteOnce]	ROX footnoteref:rox[ReadOnlyMany]	RWX footnoteref:rwx[ReadWriteMany]	Registry	Scaled registry	Monitoring	Logging	Apps
Block	Yes	Yes ^[1]	No	Configurable	Not configurable	Recommended	Recommended	Recommended
File	Yes	Yes ^[]	Yes	Configurable	Configurable	Configurable footnoteref:metrics-warning[For monitoring components, using file storage with the ReadWriteMany (RWX) access mode is unreliable. If you use file storage, do not configure the RWX access mode on any PersistentVolumeClaims that are configured for use with monitoring.]	Configurable footnoteref:logging-warning[For logging, using any shared storage would be an anti-pattern. One volume per logging-es is required.]	Recommended
Object	Yes	Yes		Recommended	Recommended	Not configurable	Not configurable	Not configurable footnoteref:object[Object storage is not consumed through OKD’s PVs/persistent volume claims (PVCs). Apps must integrate with the object storage REST API. ]

A scaled registry is an OKD registry where three or more pod replicas are running.

Registry

In a non-scaled/high-availability (HA) OKD registry cluster deployment:

The preferred storage technology is object storage followed by block storage. The storage technology does not need to support RWX access mode.
While volumes are configurable for a non-scaled/HA OKD Registry, they are not recommended for cluster deployment.

Scaled registry

In a scaled/HA OKD registry cluster deployment:

The preferred storage technology is object storage. The storage technology must support RWX access mode and must ensure read-after-write consistency.
File storage and block storage are not recommended for a scaled/HA OKD registry cluster deployment with production workloads.
All NAS storage (excluding Containerized GlusterFS/External GlusterFS GlusterFS as it uses an object storage interface) are not recommended for OKD Registry cluster deployment with production workloads.

Monitoring

In an OKD hosted monitoring cluster deployment:

The preferred storage technology is block storage.
If you decide to configure file storage, make sure that it follows POSIX standards.

Testing shows significant unrecoverable corruptions using NFS and, therefore, is not recommended for use.

Other NFS implementations on the marketplace might not have these issues. Contact the individual NFS implementation vendor for more information on any testing that was possibly completed against these OpenShift core components.

Logging

In an OKD hosted logging cluster deployment:

The preferred storage technology is block storage.
It is not recommended to use NAS storage (excluding Containerized GlusterFS/External GlusterFS GlusterFS as it uses a block storage interface from iSCSI) for a hosted metrics cluster deployment with production workloads.

Testing shows issues with using the NFS server on RHEL as storage backend for the container image registry. This includes ElasticSearch for logging storage. Therefore, using NFS to back PVs used by core services is not recommended.

Applications

Application use cases vary from application to application, as described in the following examples:

Storage technologies that support dynamic PV provisioning have low mount time latencies, and are not tied to nodes to support a healthy cluster.
Application developers are responsible for knowing and understanding the storage requirements for their application, and how it works with the provided storage to ensure that issues do not occur when an application scales or interacts with the storage layer.

OKD Internal etcd: For the best etcd reliability, the lowest consistent latency storage technology is preferable.
Databases: Databases (RDBMSs, NoSQL DBs, etc.) tend to perform best with dedicated block storage.

Container runtimes store images and containers in a graph driver (a pluggable storage technology), such as DeviceMapper and OverlayFS. Each has advantages and disadvantages.

For more information about OverlayFS, including supportability and usage caveats, see the for your version.

Table 3. Graph driver comparisons
Name	Description	Benefits	Limitations
OverlayFS overlay overlay2	Combines a lower (parent) and upper (child) filesystem and a working directory (on the same filesystem as the child). The lower filesystem is the base image, and when you create new containers, a new upper filesystem is created containing the deltas.	Faster than Device Mapper at starting and stopping containers. The startup time difference between Device Mapper and Overlay is generally less than one second. Allows for page cache sharing.	Not POSIX compliant.
Device Mapper Thin Provisioning		There are measurable performance advantages at moderate load and high density. It gives you per-container limits for capacity (10G by default).	You have to have a dedicated partition for it. It is not set up by default in Red Hat Enterprise Linux (RHEL). All containers and images share the same pool of capacity. It cannot be resized without destroying and re-creating the pool.
Device Mapper loop-lvm	Uses the Device Mapper thin provisioning module (dm-thin-pool) to implement copy-on-write (CoW) snapshots. For each device mapper graph location, thin pool is created based on two block devices, one for data and one for metadata. By default, these block devices are created automatically by using loopback mounts of automatically created sparse files.	It works out of the box, so it is useful for prototyping and development purposes.	Not all Portable Operating System Interface for Unix (POSIX) features work (for example, `O_DIRECT`). Most importantly, this mode is unsupported for production workloads. All containers and images share the same pool of capacity. It cannot be resized without destroying and re-creating the pool.

For better performance, Red Hat strongly recommends using the overlayFS storage driver over Device Mapper. However, if you are already using Device Mapper in a production environment, Red Hat strongly recommends using thin provisioning for container images and container root file systems. Otherwise, always use overlayfs2 for Docker engine or overlayFS for CRI-O.

Using a loop device can affect performance. While you can still continue to use it, the following warning message is logged:

To ease storage configuration, use the docker-storage-setup utility, which automates much of the configuration details:

For Overlay

Edit the /etc/sysconfig/docker-storage-setup file to specify the device driver:
With OverlayFS, if you want to have imagefs on a different logical volume, then you must set CONTAINER_ROOT_LV_NAME and CONTAINER _ROOT_LV_MOUNT_PATH. Setting CONTAINER_ROOT_LV_MOUNT_PATH requires CONTAINER_ROOT_LV_NAME to be set. For example, CONTAINER_ROOT_LV_NAME="container-root-lv". See for more information.
If you had a separate disk drive dedicated to docker storage (for example, /dev/xvdb), add the following to the /etc/sysconfig/docker-storage-setup file:
```
DEVS=/dev/xvdb
VG=docker_vg
```
Restart the service:
To verify that docker is using overlay2, and to monitor disk space use, run the docker info command:
```
# docker info | egrep -i 'storage|pool|space|filesystem'
Storage Driver: overlay2 (1)
 Backing Filesystem: extfs
```
1 The docker info output when using overlay2.

OverlayFS is also supported for container runtimes use cases as of Red Hat Enterprise Linux 7.2, and provides faster start up time and page cache sharing, which can potentially improve density by reducing overall memory utilization.

For Thinpool

Edit the /etc/sysconfig/docker-storage-setup file to specify the device driver:
```
STORAGE_DRIVER=devicemapper
```
If you had a separate disk drive dedicated to docker storage (for example, /dev/xvdb), add the following to the /etc/sysconfig/docker-storage-setup file:
Restart the docker-storage-setup service:
```
# systemctl restart docker-storage-setup
```
After the restart, docker-storage-setup sets up a volume group named docker_vg and creates a thin-pool logical volume. Documentation for thin provisioning on RHEL is available in the LVM Administrator Guide. View the newly created volumes with the lsblk command:
```
# lsblk /dev/xvdb
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
└─xvdb1 202:17 0 10G 0 part
  ├─docker_vg-docker--pool_tmeta 253:0 0 12M 0 lvm
  │ └─docker_vg-docker--pool 253:2 0 6.9G 0 lvm
  └─docker_vg-docker--pool_tdata 253:1 0 6.9G 0 lvm
  └─docker_vg-docker--pool 253:2 0 6.9G 0 lvm
```
Thin-provisioned volumes are not mounted and have no file system (individual containers do have an XFS file system), thus they do not show up in df output.
To verify that docker is using an LVM thinpool, and to monitor disk space use, run the docker info command:

1 The docker info output when using devicemapper.
2 Corresponds to the VG you specified in /etc/sysconfig/docker-storage-setup.

By default, a thin pool is configured to use 40% of the underlying block device. As you use the storage, LVM automatically extends the thin pool up to 100%. This is why the Data Space Total value does not match the full size of the underlying LVM device.

In development, docker in Red Hat distributions defaults to a loopback mounted sparse file. To see if your system is using the loopback mode:

# docker info|grep loop0
 Data file: /dev/loop0

The main advantage of the OverlayFS graph is Linux page cache sharing among containers that share an image on the same node. This attribute of OverlayFS leads to reduced input/output (I/O) during container startup (and, thus, faster container startup time by several hundred milliseconds), as well as reduced memory usage when similar images are running on a node. Both of these results are beneficial in many environments, especially those with the goal of optimizing for density and have high container churn rate (such as a build farm), or those that have significant overlap in image content.

Page cache sharing is not possible with DeviceMapper because thin-provisioned devices are allocated on a per-container basis.

OverlayFS is a type of union file system. It allows you to overlay one file system on top of another. Changes are recorded in the upper file system, while the lower file system remains unmodified. This allows multiple users to share a file-system image, such as a container or a DVD-ROM, where the base image is on read-only media.

OverlayFS layers two directories on a single Linux host and presents them as a single directory. These directories are called layers, and the unification process is referred to as a union mount.

OverlayFS uses one of two graph drivers, overlay or overlay2. As of Red Hat Enterprise Linux 7.2, overlay . As of Red Hat Enterprise Linux 7.4, overlay2 became supported. SELinux on the docker daemon became supported in Red Hat Enterprise Linux 7.4. See the for information on using OverlayFS with your version of RHEL, including supportability and usage caveats.

The overlay2 driver natively supports up to 128 lower OverlayFS layers but, the overlay driver works only with a single lower OverlayFS layer. Because of this capability, the overlay2 driver provides better performance for layer-related Docker commands, such as docker build, and consumes fewer inodes on the backing filesystem.

Because the overlay driver works with a single lower OverlayFS layer, you cannot implement multi-layered images as multiple OverlayFS layers. Instead, each image layer is implemented as its own directory under /var/lib/docker/overlay. Hard links are then used as a space-efficient way to reference data shared with lower layers.

Docker recommends using the overlay2 driver with OverlayFS rather than the overlay driver, because it is more efficient in terms of inode utilization.

. This does not apply to physical disk, VM physical disk, VMDK, loopback over NFS, AWS EBS, Azure Disk and Cinder (the latter for block).

1	The `docker info` output when using `devicemapper`.
2	Corresponds to the `VG` you specified in */etc/sysconfig/docker-storage-setup*.