Scaling Cluster Metrics

This topic provides information for scaling the metrics components.

Run metrics pods on dedicated OKD infrastructure nodes.
Keep the METRICS_RESOLUTION=30 parameter in OKD metrics deployments. Using a value lower than the default value of 30 for METRICS_RESOLUTION is not recommended. When using the Ansible metrics installation procedure, this is the openshift_metrics_resolution parameter.
Closely monitor OKD nodes with host metrics pods to detect early capacity shortages (CPU and memory) on the host system. These capacity shortages can cause problems for metrics pods.
In OKD version 3.7 testing, test cases up to 25,000 pods were monitored in a OKD cluster.

In tests performed with 210 and 990 OKD nodes, where 10500 pods and 11000 pods were monitored respectively, the Cassandra database grew at the speed shown in the table below:

In the above calculation, approximately 20 percent of the expected size was added as overhead to ensure that the storage requirements do not exceed calculated value.

If the METRICS_DURATION and values are kept at the default (7 days and 15 seconds respectively), it is safe to plan Cassandra storage size requirements for week, as in the values above.

One set of metrics pods (Cassandra/Hawkular/Heapster) is able to monitor at least 25,000 pods.

Scaling the Cassandra Components

Cassandra nodes use persistent storage. Therefore, scaling up or down is not possible with replication controllers.

Scaling a Cassandra cluster requires modifying the openshift_metrics_cassandra_replicas variable and re-running the deployment. By default, the Cassandra cluster is a single-node cluster. To deploy more nodes, provision storage if equals pv and increase the openshift_metrics_cassandra_replicas value.

To scale up the number of OKD metrics hawkular pods to two replicas, run:

Alternatively, update your inventory file and re-run the .

To scale down:

Once the previous command succeeds, scale down the rc for the Cassandra instance to 0.

This will remove the Cassandra pod.