Restrict a Container’s Syscalls with Seccomp

    Seccomp stands for secure computing mode and has been a feature of the Linux kernel since version 2.6.12. It can be used to sandbox the privileges of a process, restricting the calls it is able to make from userspace into the kernel. Kubernetes lets you automatically apply seccomp profiles loaded onto a Node to your Pods and containers.

    Identifying the privileges required for your workloads can be difficult. In this tutorial, you will go through how to load seccomp profiles into a local Kubernetes cluster, how to apply them to a Pod, and how you can begin to craft profiles that give only the necessary privileges to your container processes.

    • Learn how to load seccomp profiles on a node
    • Learn how to apply a seccomp profile to a container
    • Observe auditing of syscalls made by a container process
    • Observe behavior when a missing profile is specified
    • Observe a violation of a seccomp profile
    • Learn how to create fine-grained seccomp profiles
    • Learn how to apply a container runtime default seccomp profile

    Before you begin

    In order to complete all steps in this tutorial, you must install and kubectl. This tutorial will show examples with both alpha (pre-v1.19) and generally available seccomp functionality, so make sure that your cluster is for the version you are using.

    Create Seccomp Profiles

    The contents of these profiles will be explored later on, but for now go ahead and download them into a directory named profiles/ so that they can be loaded into the cluster.

    pods/security/seccomp/profiles/violation.json Restrict a Container’s Syscalls with Seccomp (EN) - 图1

    1. {
    2. "defaultAction": "SCMP_ACT_ERRNO"
    3. }

    1. {
    2. "defaultAction": "SCMP_ACT_ERRNO",
    3. "architectures": [
    4. "SCMP_ARCH_X86_64",
    5. "SCMP_ARCH_X86",
    6. "SCMP_ARCH_X32"
    7. ],
    8. "syscalls": [
    9. {
    10. "names": [
    11. "accept4",
    12. "epoll_wait",
    13. "pselect6",
    14. "futex",
    15. "madvise",
    16. "epoll_ctl",
    17. "getsockname",
    18. "setsockopt",
    19. "vfork",
    20. "mmap",
    21. "read",
    22. "write",
    23. "close",
    24. "arch_prctl",
    25. "sched_getaffinity",
    26. "munmap",
    27. "brk",
    28. "rt_sigaction",
    29. "rt_sigprocmask",
    30. "sigaltstack",
    31. "gettid",
    32. "clone",
    33. "bind",
    34. "socket",
    35. "openat",
    36. "readlinkat",
    37. "exit_group",
    38. "epoll_create1",
    39. "listen",
    40. "rt_sigreturn",
    41. "sched_yield",
    42. "clock_gettime",
    43. "connect",
    44. "dup2",
    45. "epoll_pwait",
    46. "execve",
    47. "exit",
    48. "fcntl",
    49. "getpid",
    50. "getuid",
    51. "ioctl",
    52. "mprotect",
    53. "nanosleep",
    54. "open",
    55. "poll",
    56. "recvfrom",
    57. "sendto",
    58. "set_tid_address",
    59. "setitimer",
    60. "writev"
    61. ],
    62. "action": "SCMP_ACT_ALLOW"
    63. }
    64. ]
    65. }

    For simplicity, kind can be used to create a single node cluster with the seccomp profiles loaded. Kind runs Kubernetes in Docker, so each node of the cluster is actually just a container. This allows for files to be mounted in the filesystem of each container just as one might load files onto a node.

    Restrict a Container’s Syscalls with Seccomp (EN) - 图2

    1. apiVersion: kind.x-k8s.io/v1alpha4
    2. nodes:
    3. - role: control-plane
    4. extraMounts:
    5. - hostPath: "./profiles"
    6. containerPath: "/var/lib/kubelet/seccomp/profiles"

    Download the example above, and save it to a file named kind.yaml. Then create the cluster with the configuration.

    1. kind create cluster --config=kind.yaml

    Once the cluster is ready, identify the container running as the single node cluster:

    1. docker ps

    You should see output indicating that a container is running with name kind-control-plane.

    1. CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
    2. 6a96207fed4b kindest/node:v1.18.2 "/usr/local/bin/entr…" 27 seconds ago Up 24 seconds 127.0.0.1:42223->6443/tcp kind-control-plane

    If observing the filesystem of that container, one should see that the profiles/ directory has been successfully loaded into the default seccomp path of the kubelet. Use docker exec to run a command in the Pod:

    1. docker exec -it 6a96207fed4b ls /var/lib/kubelet/seccomp/profiles
    1. audit.json fine-grained.json violation.json

    Create a Pod with a Seccomp profile for syscall auditing

    To start off, apply the audit.json profile, which will log all syscalls of the process, to a new Pod.

    Download the correct manifest for your Kubernetes version:

    1. apiVersion: v1
    2. kind: Pod
    3. metadata:
    4. name: audit-pod
    5. labels:
    6. app: audit-pod
    7. spec:
    8. securityContext:
    9. seccompProfile:
    10. type: Localhost
    11. localhostProfile: profiles/audit.json
    12. containers:
    13. - name: test-container
    14. image: hashicorp/http-echo:0.2.3
    15. args:
    16. - "-text=just made some syscalls!"
    17. securityContext:
    18. allowPrivilegeEscalation: false
    1. apiVersion: v1
    2. kind: Pod
    3. metadata:
    4. name: audit-pod
    5. labels:
    6. app: audit-pod
    7. annotations:
    8. seccomp.security.alpha.kubernetes.io/pod: localhost/profiles/audit.json
    9. spec:
    10. containers:
    11. - name: test-container
    12. image: hashicorp/http-echo:0.2.3
    13. args:
    14. - "-text=just made some syscalls!"
    15. securityContext:
    16. allowPrivilegeEscalation: false

    Create the Pod in the cluster:

    1. kubectl apply -f audit-pod.yaml

    This profile does not restrict any syscalls, so the Pod should start successfully.

    1. kubectl get pod/audit-pod
    1. NAME READY STATUS RESTARTS AGE
    2. audit-pod 1/1 Running 0 30s

    In order to be able to interact with this endpoint exposed by this container,create a NodePort Service that allows access to the endpoint from inside the kind control plane container.

    Check what port the Service has been assigned on the node.

    1. kubectl get svc/audit-pod
    1. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
    2. audit-pod NodePort 10.111.36.142 <none> 5678:32373/TCP 72s

    Now you can curl the endpoint from inside the kind control plane container at the port exposed by this Service. Use docker exec to run a command in the Pod:

    1. docker exec -it 6a96207fed4b curl localhost:32373
    1. just made some syscalls!

    You can see that the process is running, but what syscalls did it actually make? Because this Pod is running in a local cluster, you should be able to see those in /var/log/syslog. Open up a new terminal window and tail the output for calls from http-echo:

    1. tail -f /var/log/syslog | grep 'http-echo'

    You should already see some logs of syscalls made by http-echo, and if you curl the endpoint in the control plane container you will see more written.

    1. Jul 6 15:37:40 my-machine kernel: [369128.669452] audit: type=1326 audit(1594067860.484:14536): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=51 compat=0 ip=0x46fe1f code=0x7ffc0000
    2. Jul 6 15:37:40 my-machine kernel: [369128.669453] audit: type=1326 audit(1594067860.484:14537): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=54 compat=0 ip=0x46fdba code=0x7ffc0000
    3. Jul 6 15:37:40 my-machine kernel: [369128.669455] audit: type=1326 audit(1594067860.484:14538): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x455e53 code=0x7ffc0000
    4. Jul 6 15:37:40 my-machine kernel: [369128.669456] audit: type=1326 audit(1594067860.484:14539): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=288 compat=0 ip=0x46fdba code=0x7ffc0000
    5. Jul 6 15:37:40 my-machine kernel: [369128.669517] audit: type=1326 audit(1594067860.484:14540): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=0 compat=0 ip=0x46fd44 code=0x7ffc0000
    6. Jul 6 15:37:40 my-machine kernel: [369128.669519] audit: type=1326 audit(1594067860.484:14541): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=270 compat=0 ip=0x4559b1 code=0x7ffc0000
    7. Jul 6 15:38:40 my-machine kernel: [369188.671648] audit: type=1326 audit(1594067920.488:14559): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=270 compat=0 ip=0x4559b1 code=0x7ffc0000
    8. Jul 6 15:38:40 my-machine kernel: [369188.671726] audit: type=1326 audit(1594067920.488:14560): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x455e53 code=0x7ffc0000

    You can begin to understand the syscalls required by the http-echo process by looking at the syscall= entry on each line. While these are unlikely to encompass all syscalls it uses, it can serve as a basis for a seccomp profile for this container.

    Clean up that Pod and Service before moving to the next section:

    1. kubectl delete pod/audit-pod
    2. kubectl delete svc/audit-pod

    Create Pod with Seccomp Profile that Causes Violation

    For demonstration, apply a profile to the Pod that does not allow for any syscalls.

    Download the correct manifest for your Kubernetes version:

    Restrict a Container’s Syscalls with Seccomp (EN) - 图3

    1. apiVersion: v1
    2. kind: Pod
    3. metadata:
    4. name: violation-pod
    5. labels:
    6. app: violation-pod
    7. spec:
    8. securityContext:
    9. seccompProfile:
    10. type: Localhost
    11. localhostProfile: profiles/violation.json
    12. containers:
    13. - name: test-container
    14. image: hashicorp/http-echo:0.2.3
    15. args:
    16. - "-text=just made some syscalls!"
    17. securityContext:
    18. allowPrivilegeEscalation: false

    pods/security/seccomp/alpha/violation-pod.yaml

    1. apiVersion: v1
    2. metadata:
    3. name: violation-pod
    4. labels:
    5. app: violation-pod
    6. annotations:
    7. spec:
    8. containers:
    9. - name: test-container
    10. image: hashicorp/http-echo:0.2.3
    11. args:
    12. - "-text=just made some syscalls!"
    13. securityContext:
    14. allowPrivilegeEscalation: false

    Create the Pod in the cluster:

    1. kubectl apply -f violation-pod.yaml

    If you check the status of the Pod, you should see that it failed to start.

    1. kubectl get pod/violation-pod
    1. NAME READY STATUS RESTARTS AGE
    2. violation-pod 0/1 CrashLoopBackOff 1 6s

    As seen in the previous example, the http-echo process requires quite a few syscalls. Here seccomp has been instructed to error on any syscall by setting "defaultAction": "SCMP_ACT_ERRNO". This is extremely secure, but removes the ability to do anything meaningful. What you really want is to give workloads only the privileges they need.

    1. kubectl delete pod/violation-pod
    2. kubectl delete svc/violation-pod

    If you take a look at the fine-pod.json, you will notice some of the syscalls seen in the first example where the profile set "defaultAction": "SCMP_ACT_LOG". Now the profile is setting "defaultAction": "SCMP_ACT_ERRNO", but explicitly allowing a set of syscalls in the "action": "SCMP_ACT_ALLOW" block. Ideally, the container will run successfully and you will see no messages sent to syslog.

    Download the correct manifest for your Kubernetes version:

    Restrict a Container’s Syscalls with Seccomp (EN) - 图4

    pods/security/seccomp/alpha/fine-pod.yaml

    1. apiVersion: v1
    2. kind: Pod
    3. metadata:
    4. name: fine-pod
    5. labels:
    6. app: fine-pod
    7. annotations:
    8. seccomp.security.alpha.kubernetes.io/pod: localhost/profiles/fine-grained.json
    9. spec:
    10. containers:
    11. - name: test-container
    12. image: hashicorp/http-echo:0.2.3
    13. args:
    14. - "-text=just made some syscalls!"
    15. securityContext:
    16. allowPrivilegeEscalation: false

    Create the Pod in your cluster:

    1. kubectl apply -f fine-pod.yaml

    The Pod should start successfully.

    1. kubectl get pod/fine-pod
    1. NAME READY STATUS RESTARTS AGE
    2. fine-pod 1/1 Running 0 30s

    Open up a new terminal window and tail the output for calls from http-echo:

    1. tail -f /var/log/syslog | grep 'http-echo'

    Expose the Pod with a NodePort Service:

    1. kubectl expose pod/fine-pod --type NodePort --port 5678

    Check what port the Service has been assigned on the node:

    1. kubectl get svc/fine-pod
    1. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
    2. fine-pod NodePort 10.111.36.142 <none> 5678:32373/TCP 72s

    curl the endpoint from inside the kind control plane container:

    1. docker exec -it 6a96207fed4b curl localhost:32373
    1. just made some syscalls!

    You should see no output in the syslog because the profile allowed all necessary syscalls and specified that an error should occur if one outside of the list is invoked. This is an ideal situation from a security perspective, but required some effort in analyzing the program. It would be nice if there was a simple way to get closer to this security without requiring as much effort.

    Clean up that Pod and Service before moving to the next section:

    1. kubectl delete pod/fine-pod
    2. kubectl delete svc/fine-pod

    Create Pod that uses the Container Runtime Default Seccomp Profile

    Most container runtimes provide a sane set of default syscalls that are allowed or not. The defaults can easily be applied in Kubernetes by using the runtime/default annotation or setting the seccomp type in the security context of a pod or container to RuntimeDefault.

    Download the correct manifest for your Kubernetes version:

    Restrict a Container’s Syscalls with Seccomp (EN) - 图5

    1. apiVersion: v1
    2. kind: Pod
    3. metadata:
    4. name: audit-pod
    5. labels:
    6. app: audit-pod
    7. spec:
    8. securityContext:
    9. seccompProfile:
    10. type: RuntimeDefault
    11. containers:
    12. - name: test-container
    13. image: hashicorp/http-echo:0.2.3
    14. args:
    15. - "-text=just made some syscalls!"
    16. securityContext:
    17. allowPrivilegeEscalation: false

    pods/security/seccomp/alpha/default-pod.yaml

    1. apiVersion: v1
    2. kind: Pod
    3. metadata:
    4. name: default-pod
    5. labels:
    6. app: default-pod
    7. annotations:
    8. seccomp.security.alpha.kubernetes.io/pod: runtime/default
    9. spec:
    10. containers:
    11. - name: test-container
    12. image: hashicorp/http-echo:0.2.3
    13. args:
    14. - "-text=just made some syscalls!"
    15. allowPrivilegeEscalation: false

    What’s next

    Additional resources: