Understanding OKD update duration

    The following factors can affect your cluster update duration:

    • The number of nodes in the cluster

    • The health of the cluster nodes

    In OKD, the cluster update happens in two phases:

    • Cluster Version Operator (CVO) target update payload deployment

    • Machine Config Operator (MCO) node updates

    The Cluster Version Operator (CVO) retrieves the target update release image and applies to the cluster. All components which run as pods are updated during this phase, whereas the host components are updated by the Machine Config Operator (MCO). This process might take 60 to 120 minutes.

    Additional resources

    Machine Config Operator node updates

    1. Update the operating system (OS)

    2. Reboot the nodes

    3. Uncordon all nodes and schedule workloads on the node

    The time to complete this process depends on several factors including the node and infrastructure configuration. This process might take 5 or more minutes to complete per node.

    In addition to MCO, you should consider the impact of the following parameters:

    • The control plane node update duration is predictable and oftentimes shorter than compute nodes, because the control plane workloads are tuned for graceful updates and quick drains.

    • You can update the compute nodes in parallel by setting the maxUnavailable field to greater than 1 in the Machine Config Pool (MCP). The MCO cordons the number of nodes specified in maxUnavailable and marks them unavailable for update.

    • When you increase maxUnavailable on the MCP, it can help the pool to update more quickly. However, if maxUnavailable is set too high, and several nodes are cordoned simultaneously, the pod disruption budget (PDB) guarded workloads could fail to drain because a schedulable node cannot be found to run the replicas. If you increase for the MCP, ensure that you still have sufficient schedulable nodes to allow PDB guarded workloads to drain.

    • Before you begin the update, you must ensure that all the nodes are available. Any unavailable nodes can significantly impact the update duration because the node unavailability affects the maxUnavailable and pod disruption budgets.

      To check the status of nodes from the terminal, run the following command:

      If the status of the node is NotReady or SchedulingDisabled, then the node is not available and this impacts the update duration.

      You can check the status of nodes from the Administrator perspective in the web console by expanding ComputeNode.

    Additional resources

    Historical update duration of similar clusters provides you the best estimate for the future cluster updates. However, if the historical data is not available, you can use the following convention to estimate your cluster update time:

    A node update iteration consists of one or more nodes updated in parallel. The control plane nodes are always updated in parallel with the compute nodes. In addition, one or more compute nodes can be updated in parallel based on the maxUnavailable value.

    For example, to estimate the update time, consider an OKD cluster with three control plane nodes and six compute nodes and each host takes about 5 minutes to reboot.

    Scenario-1

    When you set maxUnavailable to for both the control plane and compute nodes Machine Config Pool (MCP), then all the six compute nodes will update one after another in each iteration:

    Scenario-2