Using sysctls in containers

    Network sysctls are a special category of sysctl. Network sysctls include:

    • System-wide sysctls, for example , that are valid for all networking. You can set these independently for each pod on a node.

    • Interface-specific sysctls, for example net.ipv4.conf.IFNAME.accept_local, that only apply to a specific additional network interface for a given pod. You can set these independently for each additional network configuration. You set these by using a configuration in the tuning-cni after the network interfaces are created.

    Moreover, only those sysctls considered safe are whitelisted by default; you can manually enable other unsafe sysctls on the node to be available to the user.

    Additional resources

    In Linux, the sysctl interface allows an administrator to modify kernel parameters at runtime. Parameters are available from the _/proc/sys/_ virtual process file system. The parameters cover various subsystems, such as:

    • kernel (common prefix: _kernel._)

    • networking (common prefix: _net._)

    • virtual memory (common prefix: _vm._)

    • MDADM (common prefix: _dev._)

    More subsystems are described in Kernel documentation. To get a list of all parameters, run:

    Namespaced and node-level sysctls

    A number of sysctls are namespaced in the Linux kernels. This means that you can set them independently for each pod on a node. Being namespaced is a requirement for sysctls to be accessible in a pod context within Kubernetes.

    The following sysctls are known to be namespaced:

    • _kernel.shm*_

    • _kernel.msg*_

    • _kernel.sem_

    • _fs.mqueue.*_

    Additionally, most of the sysctls in the net.* group are known to be namespaced. Their namespace adoption differs based on the kernel version and distributor.

    Sysctls that are not namespaced are called node-level and must be set manually by the cluster administrator, either by means of the underlying Linux distribution of the nodes, such as by modifying the _/etc/sysctls.conf_ file, or by using a daemon set with privileged containers. You can use the Node Tuning Operator to set node-level sysctls.

    Sysctls are grouped into safe and unsafe sysctls.

    For system-wide sysctls to be considered safe, they must be namespaced. A namespaced sysctl ensures there is isolation between namespaces and therefore pods. If you set a sysctl for one pod it must not add any of the following:

    • Influence any other pod on the node

    • Harm the node health

    • Gain CPU or memory resources outside of the resource limits of a pod

    Being namespaced alone is not sufficient for the sysctl to be considered safe.

    Any sysctl that is not added to the allowed list on OKD is considered unsafe for OKD.

    Unsafe sysctls are not allowed by default. For system-wide sysctls the cluster administrator must manually enable them on a per-node basis. Pods with disabled unsafe sysctls are scheduled but do not launch.

    You cannot manually enable interface-specific unsafe sysctls.

    OKD adds the following system-wide and interface-specific safe sysctls to an allowed safe list:

    Table 1. System-wide safe sysctls
    sysctlDescription

    kernel.shm_rmid_forced

    When set to 1, all shared memory objects in current IPC namespace are automatically forced to use IPC_RMID. For more information, see .

    net.ipv4.ip_local_port_range

    Defines the local port range that is used by TCP and UDP to choose the local port. The first number is the first port number, and the second number is the last local port number. If possible, it is better if these numbers have different parity (one even and one odd value). They must be greater than or equal to ip_unprivileged_port_start. The default values are 32768 and 60999 respectively. For more information, see ip_local_port_range.

    net.ipv4.tcp_syncookies

    When net.ipv4.tcp_syncookies is set, the kernel handles TCP SYN packets normally until the half-open connection queue is full, at which time, the SYN cookie functionality kicks in. This functionality allows the system to keep accepting valid connections, even if under a denial-of-service attack. For more information, see .

    net.ipv4.ping_group_range

    This restricts ICMP_PROTO datagram sockets to users in the group range. The default is 1 0, meaning that nobody, not even root, can create ping sockets. For more information, see ping_group_range.

    net.ipv4.ip_unprivileged_port_start

    This defines the first unprivileged port in the network namespace. To disable all privileged ports, set this to 0. Privileged ports must not overlap with the ip_local_port_range. For more information, see .

    Table 2. Interface-specific safe sysctls
    sysctlDescription

    net.ipv4.conf.IFNAME.accept_redirects

    Accept IPv4 ICMP redirect messages.

    net.ipv4.conf.IFNAME.accept_source_route

    Accept IPv4 packets with strict source route (SRR) option.

    net.ipv4.conf.IFNAME.arp_accept

    Define behavior for gratuitous ARP frames with an IPv4 address that is not already present in the ARP table:

    • 0 - Do not create new entries in the ARP table.

    • 1 - Create new entries in the ARP table.

    Define mode for notification of IPv4 address and device changes.

    net.ipv4.conf.IFNAME.disable_policy

    Disable IPSEC policy (SPD) for this IPv4 interface.

    net.ipv4.conf.IFNAME.secure_redirects

    Accept ICMP redirect messages only to gateways listed in the interface’s current gateway list.

    net.ipv4.conf.IFNAME.send_redirects

    Send redirects is enabled only if the node acts as a router. That is, a host should not send an ICMP redirect message. It is used by routers to notify the host about a better routing path that is available for a particular destination.

    net.ipv6.conf.IFNAME.accept_ra

    Accept IPv6 Router advertisements; autoconfigure using them. It also determines whether or not to transmit router solicitations. Router solicitations are transmitted only if the functional setting is to accept router advertisements.

    net.ipv6.conf.IFNAME.accept_redirects

    Accept IPv6 ICMP redirect messages.

    net.ipv6.conf.IFNAME.accept_source_route

    Accept IPv6 packets with SRR option.

    net.ipv6.conf.IFNAME.arp_accept

    Define behavior for gratuitous ARP frames with an IPv6 address that is not already present in the ARP table:

    • 0 - Do not create new entries in the ARP table.

    • 1 - Create new entries in the ARP table.

    net.ipv6.conf.IFNAME.arp_notify

    Define mode for notification of IPv6 address and device changes.

    net.ipv6.neigh.IFNAME.base_reachable_time_ms

    This parameter controls the hardware address to IP mapping lifetime in the neighbour table for IPv6.

    net.ipv6.neigh.IFNAME.retrans_time_ms

    Set the retransmit timer for neighbor discovery messages.

    Updating the interface-specific safe sysctls list

    OKD includes a predefined list of safe interface-specific sysctls. You can modify this list by updating the cni-sysctl-allowlist in the openshift-multus namespace.

    The support for updating the interface-specific safe sysctls list is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

    For more information about the support scope of Red Hat Technology Preview features, see .

    Follow this procedure to modify the predefined list of safe sysctls. This procedure describes how to extend the default allow list.

    Procedure

    1. View the existing predefined list by running the following command:

      1. $ oc get cm -n openshift-multus cni-sysctl-allowlist -oyaml

      Expected output

      1. apiVersion: v1
      2. data:
      3. allowlist.conf: |-
      4. ^net.ipv4.conf.IFNAME.accept_redirects$
      5. ^net.ipv4.conf.IFNAME.accept_source_route$
      6. ^net.ipv4.conf.IFNAME.arp_accept$
      7. ^net.ipv4.conf.IFNAME.arp_notify$
      8. ^net.ipv4.conf.IFNAME.disable_policy$
      9. ^net.ipv4.conf.IFNAME.secure_redirects$
      10. ^net.ipv4.conf.IFNAME.send_redirects$
      11. ^net.ipv6.conf.IFNAME.accept_ra$
      12. ^net.ipv6.conf.IFNAME.accept_redirects$
      13. ^net.ipv6.conf.IFNAME.accept_source_route$
      14. ^net.ipv6.conf.IFNAME.arp_accept$
      15. ^net.ipv6.conf.IFNAME.arp_notify$
      16. ^net.ipv6.neigh.IFNAME.base_reachable_time_ms$
      17. ^net.ipv6.neigh.IFNAME.retrans_time_ms$
      18. kind: ConfigMap
      19. metadata:
      20. annotations:
      21. kubernetes.io/description: |
      22. Sysctl allowlist for nodes.
      23. release.openshift.io/version: 4.13.0-0.nightly-2022-11-16-003434
      24. creationTimestamp: "2022-11-17T14:09:27Z"
      25. name: cni-sysctl-allowlist
      26. namespace: openshift-multus
      27. resourceVersion: "2422"
      28. uid: 96d138a3-160e-4943-90ff-6108fa7c50c3
    2. Edit the list by using the following command:

      1. $ oc edit cm -n openshift-multus cni-sysctl-allowlist -oyaml

      For example, to allow you to be able to implement stricter reverse path forwarding you need to add ^net.ipv4.conf.IFNAME.rp_filter$ and ^net.ipv6.conf.IFNAME.rp_filter$ to the list as shown here:

      1. # Please edit the object below. Lines beginning with a '#' will be ignored,
      2. # and an empty file will abort the edit. If an error occurs while saving this file will be
      3. # reopened with the relevant failures.
      4. #
      5. apiVersion: v1
      6. data:
      7. allowlist.conf: |-
      8. ^net.ipv4.conf.IFNAME.accept_redirects$
      9. ^net.ipv4.conf.IFNAME.accept_source_route$
      10. ^net.ipv4.conf.IFNAME.arp_accept$
      11. ^net.ipv4.conf.IFNAME.arp_notify$
      12. ^net.ipv4.conf.IFNAME.disable_policy$
      13. ^net.ipv4.conf.IFNAME.secure_redirects$
      14. ^net.ipv4.conf.IFNAME.send_redirects$
      15. ^net.ipv4.conf.IFNAME.rp_filter$
      16. ^net.ipv6.conf.IFNAME.accept_ra$
      17. ^net.ipv6.conf.IFNAME.accept_redirects$
      18. ^net.ipv6.conf.IFNAME.accept_source_route$
      19. ^net.ipv6.conf.IFNAME.arp_accept$
      20. ^net.ipv6.conf.IFNAME.arp_notify$
      21. ^net.ipv6.neigh.IFNAME.base_reachable_time_ms$
      22. ^net.ipv6.neigh.IFNAME.retrans_time_ms$
    3. Save the changes to the file and exit.

      The removal of sysctls is also supported. Edit the file, remove the sysctl or sysctls then save the changes and exit.

    Verification

    Follow this procedure to enforce stricter reverse path forwarding for IPv4. For more information on reverse path forwarding see Reverse Path Forwarding.

    1. Create a network attachment definition, such as reverse-path-fwd-example.yaml, with the following content:

      1. apiVersion: "k8s.cni.cncf.io/v1"
      2. kind: NetworkAttachmentDefinition
      3. metadata:
      4. name: tuningnad
      5. namespace: default
      6. spec:
      7. config: '{
      8. "cniVersion": "0.4.0",
      9. "name": "tuningnad",
      10. "plugins": [{
      11. "type": "bridge"
      12. },
      13. "type": "tuning",
      14. "sysctl": {
      15. "net.ipv4.conf.IFNAME.rp_filter": "1"
      16. }
      17. }
      18. ]
      19. }'
    2. Apply the yaml by running the following command:

      1. $ oc apply -f reverse-path-fwd-example.yaml

      Example output

      1. networkattachmentdefinition.k8.cni.cncf.io/tuningnad created
    3. Create a pod such as examplepod.yaml using the following YAML:

      1. apiVersion: v1
      2. kind: Pod
      3. metadata:
      4. name: example
      5. labels:
      6. app: httpd
      7. namespace: default
      8. annotations:
      9. k8s.v1.cni.cncf.io/networks: tuningnad (1)
      10. spec:
      11. securityContext:
      12. runAsNonRoot: true
      13. seccompProfile:
      14. type: RuntimeDefault
      15. containers:
      16. - name: httpd
      17. image: 'image-registry.openshift-image-registry.svc:5000/openshift/httpd:latest'
      18. ports:
      19. - containerPort: 8080
      20. securityContext:
      21. allowPrivilegeEscalation: false
      22. capabilities:
      23. drop:
      24. - ALL
      1Specify the name of the configured NetworkAttachmentDefinition.
    4. Apply the yaml by running the following command:

      1. $ oc apply -f examplepod.yaml
    5. Verify that the pod is created by running the following command:

      1. $ oc get pod

      Example output

      1. NAME READY STATUS RESTARTS AGE
      2. example 1/1 Running 0 47s
    6. Log in to the pod by running the following command:

      1. $ oc rsh example
    7. Verify the value of the configured sysctl flag. For example, find the value net.ipv4.conf.net1.rp_filter by running the following command:

      Expected output

      1. net.ipv4.conf.net1.rp_filter = 1

    Additional resources

    You can set sysctls on pods using the pod’s securityContext. The securityContext applies to all containers in the same pod.

    Safe sysctls are allowed by default.

    This example uses the pod securityContext to set the following safe sysctls:

    • kernel.shm_rmid_forced

    • net.ipv4.ip_local_port_range

    • net.ipv4.tcp_syncookies

    • net.ipv4.ping_group_range

    To avoid destabilizing your operating system, modify sysctl parameters only after you understand their effects.

    Use this procedure to start a pod with the configured sysctl settings.

    Procedure

    1. Create a YAML file sysctl_pod.yaml that defines an example pod and add the securityContext spec, as shown in the following example:

      1. apiVersion: v1
      2. kind: Pod
      3. metadata:
      4. name: sysctl-example
      5. namespace: default
      6. spec:
      7. containers:
      8. - name: podexample
      9. image: centos
      10. command: ["bin/bash", "-c", "sleep INF"]
      11. securityContext:
      12. runAsUser: 2000 (1)
      13. runAsGroup: 3000 (2)
      14. allowPrivilegeEscalation: false (3)
      15. capabilities: (4)
      16. drop: ["ALL"]
      17. securityContext:
      18. runAsNonRoot: true (5)
      19. seccompProfile: (6)
      20. type: RuntimeDefault
      21. sysctls:
      22. - name: kernel.shm_rmid_forced
      23. value: "1"
      24. - name: net.ipv4.ip_local_port_range
      25. value: "32770 60666"
      26. - name: net.ipv4.tcp_syncookies
      27. value: "0"
      28. - name: net.ipv4.ping_group_range
      29. value: "0 200000000"
      1runAsUser controls which user ID the container is run with.
      2runAsGroup controls which primary group ID the containers is run with.
      3allowPrivilegeEscalation determines if a pod can request to allow privilege escalation. If unspecified, it defaults to true. This boolean directly controls whether the no_new_privs flag gets set on the container process.
      4capabilities permit privileged actions without giving full root access. This policy ensures all capabilities are dropped from the pod.
      5runAsNonRoot: true requires that the container will run with a user with any UID other than 0.
      6RuntimeDefault enables the default seccomp profile for a pod or container workload.
    2. Create the pod by running the following command:

      1. $ oc apply -f sysctl_pod.yaml
    3. Verify that the pod is created by running the following command:

      Example output

      1. NAME READY STATUS RESTARTS AGE
      2. sysctl-example 1/1 Running 0 14s
    4. Log in to the pod by running the following command:

      1. $ oc rsh sysctl-example
    5. Verify the values of the configured sysctl flags. For example, find the value kernel.shm_rmid_forced by running the following command:

      1. sh-4.4# sysctl kernel.shm_rmid_forced

      Expected output

      1. kernel.shm_rmid_forced = 1

    Starting a pod with unsafe sysctls

    A pod with unsafe sysctls fails to launch on any node unless the cluster administrator explicitly enables unsafe sysctls for that node. As with node-level sysctls, use the taints and toleration feature or labels on nodes to schedule those pods onto the right nodes.

    The following example uses the pod securityContext to set a safe sysctl kernel.shm_rmid_forced and two unsafe sysctls, net.core.somaxconn and kernel.msgmax. There is no distinction between safe and unsafe sysctls in the specification.

    To avoid destabilizing your operating system, modify sysctl parameters only after you understand their effects.

    The following example illustrates what happens when you add safe and unsafe sysctls to a pod specification:

    Procedure

    1. Create a YAML file sysctl-example-unsafe.yaml that defines an example pod and add the securityContext specification, as shown in the following example:

      1. apiVersion: v1
      2. kind: Pod
      3. name: sysctl-example-unsafe
      4. spec:
      5. containers:
      6. - name: podexample
      7. image: centos
      8. command: ["bin/bash", "-c", "sleep INF"]
      9. securityContext:
      10. runAsUser: 2000
      11. runAsGroup: 3000
      12. allowPrivilegeEscalation: false
      13. capabilities:
      14. drop: ["ALL"]
      15. securityContext:
      16. runAsNonRoot: true
      17. seccompProfile:
      18. type: RuntimeDefault
      19. sysctls:
      20. - name: kernel.shm_rmid_forced
      21. value: "0"
      22. - name: net.core.somaxconn
      23. value: "1024"
      24. - name: kernel.msgmax
      25. value: "65536"
    2. Create the pod using the following command:

      1. $ oc apply -f sysctl-example-unsafe.yaml
    3. Verify that the pod is scheduled but does not deploy because unsafe sysctls are not allowed for the node using the following command:

      1. $ oc get pod

      Example output

      1. NAME READY STATUS RESTARTS AGE
      2. sysctl-example-unsafe 0/1 SysctlForbidden 0 14s

    A cluster administrator can allow certain unsafe sysctls for very special situations such as high performance or real-time application tuning.

    If you want to use unsafe sysctls, a cluster administrator must enable them individually for a specific type of node. The sysctls must be namespaced.

    You can further control which sysctls are set in pods by specifying lists of sysctls or sysctl patterns in the allowedUnsafeSysctls field of the Security Context Constraints.

    • The allowedUnsafeSysctls option controls specific needs such as high performance or real-time application tuning.

    Due to their nature of being unsafe, the use of unsafe sysctls is at-your-own-risk and can lead to severe problems, such as improper behavior of containers, resource shortage, or breaking a node.

    Procedure

    1. List existing MachineConfig objects for your OKD cluster to decide how to label your machine config by running the following command:

      Example output

      1. NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
      2. master rendered-master-bfb92f0cd1684e54d8e234ab7423cc96 True False False 3 3 3 0 42m
      3. worker rendered-worker-21b6cb9a0f8919c88caf39db80ac1fce True False False 3 3 3 0 42m
    2. Add a label to the machine config pool where the containers with the unsafe sysctls will run by running the following command:

      1. $ oc label machineconfigpool worker custom-kubelet=sysctl
    3. Create a YAML file set-sysctl-worker.yaml that defines a KubeletConfig custom resource (CR):

      1. apiVersion: machineconfiguration.openshift.io/v1
      2. kind: KubeletConfig
      3. metadata:
      4. name: custom-kubelet
      5. spec:
      6. machineConfigPoolSelector:
      7. matchLabels:
      8. custom-kubelet: sysctl (1)
      9. kubeletConfig:
      10. allowedUnsafeSysctls: (2)
      11. - "kernel.msg*"
      12. - "net.core.somaxconn"
      1Specify the label from the machine config pool.
      2List the unsafe sysctls you want to allow.
    4. Create the object by running the following command:

      1. $ oc apply -f set-sysctl-worker.yaml
    5. Wait for the Machine Config Operator to generate the new rendered configuration and apply it to the machines by running the following command:

      1. $ oc get machineconfigpool worker -w

      After some minutes the UPDATING status changes from True to False:

      1. NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
      2. worker rendered-worker-f1704a00fc6f30d3a7de9a15fd68a800 False True False 3 2 2 0 71m
      3. worker rendered-worker-f1704a00fc6f30d3a7de9a15fd68a800 False True False 3 2 3 0 72m
      4. worker rendered-worker-0188658afe1f3a183ec8c4f14186f4d5 True False False 3 3 3 0 72m
    6. Create a YAML file sysctl-example-safe-unsafe.yaml that defines an example pod and add the securityContext spec, as shown in the following example:

      1. apiVersion: v1
      2. kind: Pod
      3. metadata:
      4. name: sysctl-example-safe-unsafe
      5. spec:
      6. containers:
      7. - name: podexample
      8. image: centos
      9. command: ["bin/bash", "-c", "sleep INF"]
      10. securityContext:
      11. runAsUser: 2000
      12. runAsGroup: 3000
      13. allowPrivilegeEscalation: false
      14. capabilities:
      15. drop: ["ALL"]
      16. securityContext:
      17. runAsNonRoot: true
      18. seccompProfile:
      19. type: RuntimeDefault
      20. sysctls:
      21. - name: kernel.shm_rmid_forced
      22. value: "0"
      23. - name: net.core.somaxconn
      24. value: "1024"
      25. - name: kernel.msgmax
      26. value: "65536"
    7. Create the pod by running the following command:

      1. $ oc apply -f sysctl-example-safe-unsafe.yaml

      Expected output

      1. Warning: would violate PodSecurity "restricted:latest": forbidden sysctls (net.core.somaxconn, kernel.msgmax)
      2. pod/sysctl-example-safe-unsafe created
    8. Verify that the pod is created by running the following command:

      1. $ oc get pod

      Example output

      1. NAME READY STATUS RESTARTS AGE
      2. sysctl-example-safe-unsafe 1/1 Running 0 19s
    9. Log in to the pod by running the following command:

      1. $ oc rsh sysctl-example-safe-unsafe
    10. Verify the values of the configured sysctl flags. For example, find the value net.core.somaxconn by running the following command:

      Expected output

      Additional resources