Controlling pod placement on nodes using node affinity rules

In OKD node affinity is a set of rules used by the scheduler to determine where a pod can be placed. The rules are defined using custom labels on the nodes and label selectors specified in pods.

Node affinity allows a pod to specify an affinity towards a group of nodes it can be placed on. The node does not have control over the placement.

For example, you could configure a pod to only run on a node with a specific CPU or in a specific availability zone.

There are two types of node affinity rules: required and preferred.

Required rules must be met before a pod can be scheduled on a node. Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.

You configure node affinity through the spec file. You can specify a required rule, a preferred rule, or both. If you specify both, the node must first meet the required rule, then attempts to meet the preferred rule.

The following example is a Pod spec with a rule that requires the pod be placed on a node with a label whose key is e2e-az-NorthSouth and whose value is either e2e-az-North or e2e-az-South:

Example pod configuration file with a node affinity required rule

1	The stanza to configure node affinity.
2	Defines a required rule.
3	The key/value pair (label) that must be matched to apply the rule.
4	The operator represents the relationship between the label on the node and the set of values in the `matchExpression` parameters in the `Pod` spec. This value can be `In`, `NotIn`, `Exists`, or `DoesNotExist`, `Lt`, or `Gt`.

The following example is a node specification with a preferred rule that a node with a label whose key is e2e-az-EastWest and whose value is either e2e-az-East or e2e-az-West is preferred for the pod:

Example pod configuration file with a node affinity preferred rule

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity: (1)
      preferredDuringSchedulingIgnoredDuringExecution: (2)
      - weight: 1 (3)
        preference:
          matchExpressions:
          - key: e2e-az-EastWest (4)
            operator: In (5)
            values:
            - e2e-az-East (4)
            - e2e-az-West (4)
  containers:
  - name: with-node-affinity
    image: docker.io/ocpqe/hello-pod

1	The stanza to configure node affinity.
2	Defines a preferred rule.
3	Specifies a weight for a preferred rule. The node with highest weight is preferred.
4	The key/value pair (label) that must be matched to apply the rule.
5	The operator represents the relationship between the label on the node and the set of values in the `matchExpression` parameters in the `Pod` spec. This value can be `In`, `NotIn`, `Exists`, or `DoesNotExist`, `Lt`, or `Gt`.

There is no explicit node anti-affinity concept, but using the NotIn or DoesNotExist operator replicates that behavior.

Configuring a required node affinity rule

Required rules must be met before a pod can be scheduled on a node.

Procedure

The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler is required to place on the node.

Add a label to a node using the oc label node command:

$ oc label node node1 e2e-az-name=e2e-az1

You can alternatively apply the following YAML to add the label:

kind: Node
apiVersion: v1
metadata:
  name: <node_name>
  labels:
    e2e-az-name: e2e-az1

In the Pod spec, use the nodeAffinity stanza to configure the requiredDuringSchedulingIgnoredDuringExecution parameter:

Specify an operator. The operator can be In, NotIn, Exists, DoesNotExist, Lt, or Gt. For example, use the operator In to require the label to be in the node:

Example output

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: e2e-az-name
            values:
            - e2e-az1
            - e2e-az2

Create the pod:
```
$ oc create -f e2e-az2.yaml
```

Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.

Procedure

The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler tries to place on the node.

Add a label to a node using the oc label node command:
```
$ oc label node node1 e2e-az-name=e2e-az3
```
In the Pod spec, use the nodeAffinity stanza to configure the preferredDuringSchedulingIgnoredDuringExecution parameter:
1. Specify a weight for the node, as a number 1-100. The node with highest weight is preferred.
2. Specify the key and values that must be met. If you want the new pod to be scheduled on the node you edited, use the same key and value parameters as the label in the node:
```
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: e2e-az-name
            operator: In
            - e2e-az3
```
3. Specify an operator. The operator can be In, NotIn, Exists, DoesNotExist, Lt, or Gt. For example, use the Operator In to require the label to be in the node.
Create the pod.

Sample node affinity rules

The following examples demonstrate node affinity.

The following example demonstrates node affinity for a node and pod with matching labels:

The Node1 node has the label zone:us:

$ oc label node node1 zone=us

You can alternatively apply the following YAML to add the label:

kind: Node
apiVersion: v1
metadata:
  name: <node_name>
  labels:
    zone: us

The pod-s1 pod has the zone and us key/value pair under a required node affinity rule:

$ cat pod-s1.yaml

Example output

apiVersion: v1
kind: Pod
metadata:
  name: pod-s1
spec:
  containers:
    - image: "docker.io/ocpqe/hello-pod"
      name: hello-pod
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: "zone"
              operator: In
              values:
              - us

The pod-s1 pod can be scheduled on Node1:

$ oc get pod -o wide

Example output

NAME     READY     STATUS       RESTARTS   AGE      IP      NODE
pod-s1   1/1       Running      0          4m       IP1     node1

Node affinity with no matching labels

The following example demonstrates node affinity for a node and pod without matching labels:

The Node1 node has the label zone:emea:
```
$ oc label node node1 zone=emea
```

The pod-s1 pod has the zone and us key/value pair under a required node affinity rule:

$ cat pod-s1.yaml

Example output

apiVersion: v1
kind: Pod
metadata:
  name: pod-s1
spec:
  containers:
    - image: "docker.io/ocpqe/hello-pod"
      name: hello-pod
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: "zone"
              values:
              - us

The pod-s1 pod cannot be scheduled on Node1:

$ oc describe pod pod-s1

Example output

...
Events:
 FirstSeen LastSeen Count From              SubObjectPath  Type                Reason
 --------- -------- ----- ----              -------------  --------            ------
 1m        33s      8     default-scheduler Warning        FailedScheduling    No nodes are available that match all of the following predicates:: MatchNodeSelector (1).

By default, when you install an Operator, OKD installs the Operator pod to one of your worker nodes randomly. However, there might be situations where you want that pod scheduled on a specific node or set of nodes.

The following examples describe situations where you might want to schedule an Operator pod to a specific node or set of nodes:

If an Operator requires a particular platform, such as amd64 or arm64
If an Operator requires a particular operating system, such as Linux or Windows
If you want Operators that work together scheduled on the same host or on hosts located on the same rack
If you want Operators dispersed throughout the infrastructure to avoid downtime due to network or hardware issues

You can control where an Operator pod is installed by adding a node affinity constraints to the Operator’s Subscription object.

The following examples show how to use node affinity to install an instance of the Custom Metrics Autoscaler Operator to a specific node in the cluster:

Node affinity example that places the Operator pod on a specific node

apiVersion: operators.coreos.com/v1alpha1
metadata:
  name: openshift-custom-metrics-autoscaler-operator
  namespace: openshift-keda
spec:
  name: my-package
  source: my-operators
  sourceNamespace: operator-registries
  config:
    affinity:
      nodeAffinity: (1)
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
              - ip-10-0-163-94.us-west-2.compute.internal
 ...

1	A node affinity that requires the Operator’s pod to be scheduled on a node named `ip-10-0-163-94.us-west-2.compute.internal`.

Node affinity example that places the Operator pod on a node with a specific platform

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-custom-metrics-autoscaler-operator
  namespace: openshift-keda
spec:
  name: my-package
  source: my-operators
  sourceNamespace: operator-registries
  config:
    affinity:
      nodeAffinity: (1)
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/arch
              operator: In
              values:
              - arm64
            - key: kubernetes.io/os
              operator: In
              values:
              - linux

1	A node affinity that requires the Operator’s pod to be scheduled on a node with the `kubernetes.io/arch=arm64` and `kubernetes.io/os=linux` labels.

Procedure

To control the placement of an Operator pod, complete the following steps:

Install the Operator as usual.
If needed, ensure that your nodes are labeled to properly respond to the affinity.

Edit the Operator Subscription object to add an affinity:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-custom-metrics-autoscaler-operator
  namespace: openshift-keda
spec:
  name: my-package
  source: my-operators
  sourceNamespace: operator-registries
  config:
    affinity: (1)
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
              - ip-10-0-185-229.ec2.internal
 ...

Verification

To ensure that the pod is deployed on the specific node, run the following command:

Example output


custom-metrics-autoscaler-operator-5dcc45d656-bhshg   1/1     Running   0          50s   10.131.0.20   ip-10-0-185-229.ec2.internal   <none>           <none>

Additional resources

For information about changing node labels, see Understanding how to update labels on nodes.