Exam Tip: The exam will probably not ask us to create a seccomp profile by hand but it probably weill ask us to copy an existing seccomp profile to the correct directory and use it in pods

All seccomp related documentation can be found in the kubernetes docs by searching for seccomp which will lead here: https://kubernetes.io/docs/tutorials/security/seccomp/

System Calls consist of communication between user space where applications are run and kernel space

Examples of system calls are open(), close(), execve(), readdir(), strlen(), closedir(), etc.

strace

Determine what syscalls an application uses with strace

# strace is a tool used to trace system calls by an application
/usr/bin/strace

# Simply add strace to the beginning of a command
# This provides a lot of detail
strace touch /tmp/error.log
# execve("/usr/bin/touch", ["touch", "/tmp/error.log"], 0x7ffdb5aef278 /* 40 vars */) = 0
# ...

# to trace a running process we need the pid
pidof etcd
# 3596

# Now use that pid to attach to the process with strace
strace -p 3596

# -c or --summary-only will provide a summary
strace -c touch /tmp/error.log

Results

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0         3           read
  0.00    0.000000           0        22           close
  0.00    0.000000           0        18           fstat
  0.00    0.000000           0        22           mmap
  0.00    0.000000           0         3           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         6           pread64
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           dup2
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         2         1 arch_prctl
  0.00    0.000000           0        19           openat
  0.00    0.000000           0         1           utimensat
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                   103         2 total

Aquasec Tracee

Used to trace system calls on containers

Tracee can be installed in the OS but it is often easier to run it as a Docker container
The Tracee container needs some volumes, privileged and host pid. Memorize this

# To trace "command" of "ls"
docker run --name tracee --rm --privileged --pid=host \
  -v /lib/modules/:/lib/modules/:ro \
  -v /usr/src:/usr/src:ro \
  -v /tmp/tracee:/tmp/tracee \
  aquasec/tracee:0.4.0 --trace comm=ls

# To trace all new pids on the host
docker run --name tracee --rm --privileged --pid=host \
  -v /lib/modules/:/lib/modules/:ro \
  -v /usr/src:/usr/src:ro \
  -v /tmp/tracee:/tmp/tracee \
  aquasec/tracee:0.4.0 --trace pid=new

# Trace all new containers
docker run --name tracee --rm --privileged --pid=host \
  -v /lib/modules/:/lib/modules/:ro \
  -v /usr/src:/usr/src:ro \
  -v /tmp/tracee:/tmp/tracee \
  aquasec/tracee:0.4.0 --trace container=new

Restrict syscalls with seccomp

There are about 435 syscalls in Linux and all can be used by applications, however in reality, no application will need to make this many syscalls. Having access to these syscalls increases attack service

In 2016 the Dirty Cow vulnerability used ptrace to write to a read only file, gain access to root and break out of the container

seccomp can be used to restrict syscalls

Check if seccomp is installed

grep -i seccomp /boot/config-$(uname -r)

Output

CONFIG_SECCOMP=y

Run the whalesay conatiner

docker run docker/whalesay cowsay hello!

# Now run the same image but exec in
docker run -it --rm docker/whalesay /bin/sh

# Try to set date
date -s '19 APR 2012 22:00:00'
# fails

# Check pid of shell
ps -ef

results

UID        PID  PPID  C STIME TTY      TIME CMD
root         1     0  0 18:38 pts/0    00:00:00 /bin/sh
root         8     1  0 18:38 pts/0    00:00:00 ps -ef

Using the pid of the shell, check for seccomp status

grep Seccomp /proc/1/status

Results

Seccomp:        2
Seccomp_filters:

seccomp has three modes it can be in

Mode Meaning
Mode 0 DISABLED
Mode 1 STRICT
Mode 2 FILTERED

Docker has a built in seccomp filter it applies that restricts about 60 of the calls including the ptrace syscall

seccomp in Kubernetes

# First let's inspect what is configured using amicontained container
docker run r.j3ss.co/amicontained amicontained
# 61 syscalls blocked, this is the docker default
# Seccomp is filtering

# Ran this on the test cluster on 9/1/22
k run amicontained --image=r.j3ss.co/amicontained amicontained -- amicontainer
# seccomp is disabled because Kubernetes does not enable it by default
# 23 SysCalls blocked

Run the same pod but via a definition file with seccomp enabled

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  name: amicontained
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  nodeName: lpul-k8stestwrk09
  containers:
  - args:
    - amicontained
    image: r.j3ss.co/amicontained
    name: amicontained
    securityContext:
      allowPrivilegeEscalation: false
  dnsPolicy: ClusterFirst
  restartPolicy: Never

Result:

Container Runtime: kube
Has Namespaces:
        pid: true
        user: false
AppArmor Profile: docker-default (enforce)
Capabilities:
        BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap
Seccomp: filtering
Blocked syscalls (63):
        MSGRCV SYSLOG SETPGID SETSID USELIB USTAT SYSFS VHANGUP PIVOT_ROOT _SYSCTL ACCT SETTIMEOFDAY MOUNT UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE GET_KERNEL_SYMS QUERY_MODULE QUOTACTL NFSSERVCTL GETPMSG PUTPMSG AFS_SYSCALL TUXCALL SECURITY LOOKUP_DCOOKIE CLOCK_SETTIME VSERVER MBIND SET_MEMPOLICY GET_MEMPOLICY KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL MIGRATE_PAGES UNSHARE MOVE_PAGES PERF_EVENT_OPEN FANOTIFY_INIT NAME_TO_HANDLE_AT OPEN_BY_HANDLE_AT CLOCK_ADJTIME SETNS PROCESS_VM_READV PROCESS_VM_WRITEV KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF USERFAULTFD MEMBARRIER PKEY_MPROTECT PKEY_ALLOC PKEY_FREE
Looking for Docker.sock

We can use a custom seccomp profile as well
Create the seccomp profile

# These must be created in the k8s default seccomp profile directory which is typically /var/lib/kubelet/seccomp
# Create a profiles directory
mkdir -p /var/lib/kubelet/seccomp/profiles

# Create an audit profile that will log syscalls
echo -e "{\n    \"defaultAction\": \"SCMP_ACT_LOG\"\n}" > /var/lib/kubelet/seccomp/profiles/audit.json

Create the pod on lpul-k8stestwrk09

apiVersion: v1
kind: Pod
metadata:
  name: test-audit
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      # This path must be relative to k8s default seccomp profile which is stored in /var/lib/kubelet/seccomp/
      localhostProfile: profiles/audit.json
  nodeName: lpul-k8stestwrk09
  containers:
  - command: ["bash", "-c", "echo 'I just made some syscalls' && sleep 100"]
    image: ubuntu
    name: ubuntu
    securityContext:
      allowPrivilegeEscalation: false
  restartPolicy: Never

View the logs on the node at /var/log/syslog

sudo grep syscall /var/log/syslog

Output

Sep  1 10:39:56 lpul-k8stestwrk09 kernel: [1779317.170920] audit: type=1326 audit(1662053996.821:272): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=801989 comm="runc:[2:INIT]" exe="/" sig=0 arch=c000003e syscall=202 compat=0 ip=0x55ed9c8764f3 code=0x7ffc0000
Sep  1 10:39:56 lpul-k8stestwrk09 kernel: [1779317.170927] audit: type=1326 audit(1662053996.821:273): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=801989 comm="runc:[2:INIT]" exe="/" sig=0 arch=c000003e syscall=39 compat=0 ip=0x55ed9ca1c46b code=0x7ffc0000
Sep  1 10:39:56 lpul-k8stestwrk09 kernel: [1779317.170932] audit: type=1326 audit(1662053996.821:275): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=801989 comm="runc:[2:INIT]" exe="/" sig=0 arch=c000003e syscall=59 compat=0 ip=0x55ed9c8ca93b code=0x7ffc0000
Sep  1 10:39:56 lpul-k8stestwrk09 kernel: [1779317.170937] audit: type=1326 audit(1662053996.821:274): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=801989 comm="runc:[2:INIT]" exe="/" sig=0 arch=c000003e syscall=35 compat=0 ip=0x55ed9c875f5d code=0x7ffc0000
Sep  1 10:39:56 lpul-k8stestwrk09 kernel: [1779317.171044] audit: type=1326 audit(1662053996.821:276): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=801989 comm="runc:[2:INIT]" exe="/" sig=0 arch=c000003e syscall=35 compat=0 ip=0x55ed9c875f5d code=0x7ffc0000
Sep  1 10:39:56 lpul-k8stestwrk09 kernel: [1779317.171161] audit: type=1326 audit(1662053996.821:277): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=801989 comm="runc:[2:INIT]" exe="/" sig=0 arch=c000003e syscall=35 compat=0 ip=0x55ed9c875f5d code=0x7ffc0000
Sep  1 10:39:56 lpul-k8stestwrk09 kernel: [1779317.171253] audit: type=1326 audit(1662053996.821:278): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=801989 comm="runc:[2:INIT]" exe="/" sig=0 arch=c000003e syscall=35 compat=0 ip=0x55ed9c875f5d code=0x7ffc0000
Sep  1 10:39:56 lpul-k8stestwrk09 kernel: [1779317.171342] audit: type=1326 audit(1662053996.821:279): auid=4294967295 uid=0 gid=0 ses=4294967295 

We must then map the syscall id above, for example syscall=35 using the unistd_64.h file
Note: The asm folder did not exist on the test cluster node

grep -w 35 /usr/include/asm/unistd_64.h

output

defina__NR_nanosleep 35

An easier way would be to get the syscall using tracee in Kubernetes
This will provide the syscalls of all the new contianers on the host

docker run --name tracee --rm --privileged --pid=host \
  -v /lib/modules/:/lib/modules/:ro \
  -v /usr/src:/usr/src:ro \
  -v /tmp/tracee:/tmp/tracee \
  aquasec/tracee:0.4.0 --trace container=new

Output:
The pod name is located under UTS_NAME column so we can grep for our pod name in the log
The syscall is listed under the EVENT column

TIME(s)        UTS_NAME         UID    COMM             PID/host        TID/host        RET              EVENT                ARGS
1780188.376384 test-audit       0      runc:[2:INIT]    1      /848117  1      /848117  0                execve               pathname: /usr/bin/bash, argv: [bash -c echo 'I just made some syscalls' && sleep 100]
1780188.376630 test-audit       0      runc:[2:INIT]    1      /848117  1      /848117  0                security_bprm_check  pathname: /usr/bin/bash, dev: 1789, inode: 1192553760
1780188.388974 test-audit       0      runc:[2:INIT]    1      /848117  1      /848117  0                cap_capable          cap: CAP_SYS_ADMIN
1780188.389006 test-audit       0      runc:[2:INIT]    1      /848117  1      /848117  0                cap_capable          cap: CAP_SYS_ADMIN
1780188.389015 test-audit       0      runc:[2:INIT]    1      /848117  1      /848117  0                cap_capable          cap: CAP_SYS_ADMIN
1780188.389023 test-audit       0      runc:[2:INIT]    1      /848117  1      /848117  0                cap_capable          cap: CAP_SYS_ADMIN
1780188.391193 test-audit       0      bash             1      /848117  1      /848117  -2               access               pathname: /etc/ld.so.preload, mode: R_OK
<truncated>

Now let’s try a profile that rejects all syscalls

# Create the violation profile that will deny all syscalls
echo -e "{\n    \"defaultAction\": \"SCMP_ACT_ERRNO\"\n}" > /var/lib/kubelet/seccomp/profiles/violation.json

Create the pod that will use the violation seccomp profile

apiVersion: v1
kind: Pod
metadata:
  name: test-violation
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      # This path must be relative to k8s default seccomp profile which is stored in /var/lib/kubelet/seccomp/
      localhostProfile: profiles/violation.json
  nodeName: lpul-k8stestwrk09
  containers:
  - command: ["bash", "-c", "echo 'I just made some syscalls' && sleep 100"]
    image: ubuntu
    name: ubuntu
    securityContext:
      allowPrivilegeEscalation: false
  restartPolicy: Never

When looking at the above pod we will see a STATUS of either Error or ContainerCannotRun

Example detailed whitelist seccomp profile
This is a whitelist profile because its default action is to block everything with SCMP_ACT_ERRNO then whielist what we need.
Whitelist profiles are typically the most secure

{
    "defaultAction": "SCMP_ACT_ERRNO",
    "architectures": [
        "SCMP_ARCH_X86_64",
        "SCMP_ARCH_X86",
        "SCMP_ARCH_X32"
    ],
    "syscalls": [
        {
            "names": [
                "accept4",
                "epoll_wait",
                "pselect6",
                "futex",
                "madvise",
                "epoll_ctl",
                "getsockname",
                "setsockopt",
                "vfork",
                "mmap",
                "read",
                "write",
                "close",
                "arch_prctl",
                "sched_getaffinity",
                "munmap",
                "brk",
                "rt_sigaction",
                "rt_sigprocmask",
                "sigaltstack",
                "gettid",
                "clone",
                "bind",
                "socket",
                "openat",
                "readlinkat",
                "exit_group",
                "epoll_create1",
                "listen",
                "rt_sigreturn",
                "sched_yield",
                "clock_gettime",
                "connect",
                "dup2",
                "epoll_pwait",
                "execve",
                "exit",
                "fcntl",
                "getpid",
                "getuid",
                "ioctl",
                "mprotect",
                "nanosleep",
                "open",
                "poll",
                "recvfrom",
                "sendto",
                "set_tid_address",
                "setitimer",
                "writev"
            ],
            "action": "SCMP_ACT_ALLOW"
        }
    ]
}

The following is a blacklist profile becuase its default action is to allow everything with SCMP_ACT_ALLOW then specify what we want blocked.

{
    "defaultAction": "SCMP_ACT_ALLOW",
    "architectures": [
        "SCMP_ARCH_X86_64",
        "SCMP_ARCH_X86",
        "SCMP_ARCH_X32"
    ],
    "syscalls": [
        {
            "names": [
                "socket",
                "bind",
                "listen",
                "accept",
                "accept4",
                "connect",
                "shutdown",
                "setsockopt",
                "getsockopt"
            ],
            "action": "SCMP_ACT_ERRNO"
        }
    ]
}