As preconditions to this post, the following steps can always taken to make sure a K8s cluster fulfills basic security requirements:
- disable anonymous and insecure access to the k8s api - see also here
- allow for the analysis of API access by enabling K8s audit logs
- enable etcd encryption to avoid secrets and other k8s resources being visible on the cluster nodes;
- implement network policies to restrict access across pods
- reduce the attack surface by making sure containers only run necessary processes - namely by starting from minimal images and using multi stage builds - see this for a full overview
- CIS benchmarks (center for internet security) define best practices for the configuration of various systems, including k8s; one implementation of CIS benchmarks is kube-bench;
But let's get to the main point of this post.
Container isolation is based on two concepts: i) namespaces - restricting what processes see in terms of file system, users and other processes, and ii) cgroups - restricting the actual availability of resources to processes (i.e., in terms of memory, CPU and disk).
Needless to say, containers, as other user applications, are run in the user space and use system calls to communicate with kernel space processes in the kernel. Kernel hardening allows for the restriction of possible system calls, intended to set an additional layer between applications and libraries running in the user space and the syscall interface and consequently the kernel space.
A classic tool to identify and log syscalls made by a process is strace. This can be used directly on a started command or on a specific process id (pid). Also, the count and summarize mode ("-cw") groups system calls by their type to return their frequency.
To restrict access to the syscall interface:
- Sysdig Falco - is a kernel tracing tool useful to track access to system calls and kernel space processes in order to detect unexpected and malicious behavior. This can either be installed as standalone process on the cluster nodes, or as a k8s daemonset. Falco implements default policies as well as allows for enforcing of user-defined ones. The main result is logging those failing those assertions to a log that can be further processed. Please have a look at this blog post for a tutorial of Falco on K8s.
- AppArmor - is commonly employed on the worker nodes to filter out unallowed system calls. This works throughout profiles, i.e. policies, which are built around normal app usage, i.e. by tracking which system calls are necessary for a certain process to operate. This way, shall anomalous usage occur it can be rejected. This mechanism is made possible by seccomp, a linux kernel security mode allowing a process entering that state to only make calls to a limited number of syscalls, namely read(), write(), exit() and sigreturn(). A call to any other syscall will end up with the process being terminated with a SIGKILL or SIGSYS. This way, shall the application be exploited for any reason, any call to the outside world will result in the process being killed.
Other possibilities for kernel hardening are:
- gvisor - provides an implementation of the linux kernel running in the user-space and exposes an OCI runtime called runsc, so that an isolation layer can be placed between user applications and the actual system kernel.
- kata containers - provides a container runtime over lightweight virtual machines in order to provide complete isolation from the underlying host kernel;
No comments:
Post a Comment