Linux Capabilities
What Are Linux Capabilities?
Traditionally, Linux divided processes into two categories:
- Privileged (root, UID 0): Can do everything -- bypass all kernel permission checks
- Unprivileged (all other users): Subject to full permission checking
This binary model is too coarse-grained. A web server needs to bind to port 80 (a privileged operation), but it does not need to load kernel modules or reboot the system.
Linux Capabilities decompose the monolithic "root privilege" into distinct units that can be independently granted or revoked. Instead of giving a process full root access, you grant only the specific capabilities it needs.
Key Insight
Capabilities turn the question from "does this process run as root?" to "which specific privileges does this process actually need?" This is the foundation of least-privilege security in containers.
Capabilities vs Running as Root
Linux Capability Categories
All Linux Capabilities
| Capability | Purpose | Risk Level |
|---|---|---|
CAP_AUDIT_WRITE | Write to the kernel audit log | Low |
CAP_CHOWN | Change file ownership | Medium |
CAP_DAC_OVERRIDE | Bypass file read/write/execute permission checks | High |
CAP_DAC_READ_SEARCH | Bypass file read and directory search permissions | High |
CAP_FOWNER | Bypass permission checks on operations requiring file ownership | Medium |
CAP_FSETID | Set setuid/setgid bits on files | Medium |
CAP_KILL | Send signals to any process | Medium |
CAP_MKNOD | Create special device files | Medium |
CAP_NET_ADMIN | Network configuration (interfaces, routing, firewall) | High |
CAP_NET_BIND_SERVICE | Bind to privileged ports (<1024) | Low |
CAP_NET_RAW | Use raw and packet sockets (ping, ARP) | Medium |
CAP_SETFCAP | Set file capabilities | High |
CAP_SETGID | Manipulate process GID | Medium |
CAP_SETPCAP | Modify process capabilities | High |
CAP_SETUID | Manipulate process UID | Medium |
CAP_SYS_ADMIN | Broad admin ops: mount, namespace, syslog, etc. | Critical |
CAP_SYS_BOOT | Reboot the system | High |
CAP_SYS_CHROOT | Use chroot() | Medium |
CAP_SYS_MODULE | Load/unload kernel modules | Critical |
CAP_SYS_NICE | Set process scheduling priority | Low |
CAP_SYS_PTRACE | Trace arbitrary processes (debug/inspect) | Critical |
CAP_SYS_RAWIO | Raw I/O port access | Critical |
CAP_SYS_RESOURCE | Override resource limits | Medium |
CAP_SYS_TIME | Set system clock | Medium |
CAP_SYSLOG | Perform syslog(2) operations | Low |
Default Capabilities in Containers
By default, the container runtime (Docker/containerd) grants a limited set of capabilities to containers. These are a subset of root capabilities, chosen to allow most applications to function without full root:
Default Container Capabilities
| Capability | Why It's Included |
|---|---|
CAP_AUDIT_WRITE | Writing audit logs |
CAP_CHOWN | Changing file ownership during init |
CAP_DAC_OVERRIDE | Reading files regardless of permissions |
CAP_FOWNER | Operating on files owned by others |
CAP_FSETID | Setting setuid/setgid bits |
CAP_KILL | Sending signals to child processes |
CAP_MKNOD | Creating device files |
CAP_NET_BIND_SERVICE | Binding to ports below 1024 |
CAP_NET_RAW | Raw network sockets (ping) |
CAP_SETFCAP | Setting file capabilities |
CAP_SETGID | Switching GID |
CAP_SETPCAP | Modifying capability sets |
CAP_SETUID | Switching UID |
CAP_SYS_CHROOT | Using chroot |
Even Defaults Are Too Permissive
The default set includes capabilities like NET_RAW (allows ARP spoofing) and DAC_OVERRIDE (bypasses file permissions). For hardened workloads, you should drop ALL and add back only what's needed.
Viewing Current Capabilities
# Inside a container -- check current process capabilities
cat /proc/1/status | grep Cap
# Decode capability hex values
capsh --decode=00000000a80425fb
# On the host -- check a running container
docker inspect --format '{{.HostConfig.CapAdd}}' <container>
docker inspect --format '{{.HostConfig.CapDrop}}' <container>Dangerous Capabilities
Some capabilities are especially dangerous and should almost never be granted to containers:
CAP_SYS_ADMIN -- The "New Root"
CAP_SYS_ADMIN
SYS_ADMIN is the most dangerous capability. It grants a vast collection of administrative powers:
- Mount/unmount filesystems
- Perform
clone()with new namespaces - Use
setns()to join namespaces - Configure kernel parameters via
sysctl() - Perform operations on extended attributes
- And many more...
Granting SYS_ADMIN to a container is almost equivalent to running it as privileged. It is one of the most common container escape vectors.
CAP_NET_ADMIN
Allows:
- Modify routing tables
- Configure network interfaces
- Modify firewall rules (iptables)
- Set network QoS parameters
- Configure network bridging
Risk: Network-level attacks, ARP poisoning, traffic interceptionCAP_SYS_PTRACE
Allows:
- Trace any process using ptrace()
- Read and modify process memory
- Inject code into running processes
- Bypass seccomp filters on traced processes
Risk: Container escape by tracing host processes, credential theftCAP_SYS_MODULE
Allows:
- Load arbitrary kernel modules
- Unload kernel modules
Risk: Full kernel compromise by loading malicious modulesConfiguring Capabilities in Kubernetes
Capabilities are managed through the securityContext.capabilities field in a container spec.
Dropping ALL Capabilities
The most secure starting point -- drop everything and add back only what's needed:
apiVersion: v1
kind: Pod
metadata:
name: minimal-caps
spec:
containers:
- name: app
image: nginx:1.27
securityContext:
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICEDropping Specific Dangerous Capabilities
If dropping ALL is too restrictive for your application:
apiVersion: v1
kind: Pod
metadata:
name: safer-pod
spec:
containers:
- name: app
image: nginx:1.27
securityContext:
capabilities:
drop:
- SYS_ADMIN
- NET_ADMIN
- SYS_PTRACE
- NET_RAW
- SYS_MODULECommon Capability Configurations by Workload Type
Web Server (nginx, Apache)
securityContext:
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE # Bind to port 80/443
- CHOWN # Set file ownership
- SETGID # Switch group
- SETUID # Switch user (worker process)Application Container (Node.js, Python, Java)
securityContext:
capabilities:
drop:
- ALL
# No capabilities needed for most apps on non-privileged portsNetwork Tool / Debug Container
securityContext:
capabilities:
drop:
- ALL
add:
- NET_RAW # ping, traceroute
- NET_ADMIN # network configuration (use cautiously)Capability Hierarchy and Inheritance
Complete Examples
Example 1: Hardened Nginx Pod
apiVersion: v1
kind: Pod
metadata:
name: nginx-hardened
spec:
containers:
- name: nginx
image: nginx:1.27
ports:
- containerPort: 80
securityContext:
runAsNonRoot: false # nginx master needs root initially
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE # Port 80
- CHOWN # File ownership
- SETGID # Worker process GID
- SETUID # Worker process UID
- DAC_OVERRIDE # Access config files
volumeMounts:
- name: cache
mountPath: /var/cache/nginx
- name: run
mountPath: /var/run
- name: tmp
mountPath: /tmp
volumes:
- name: cache
emptyDir: {}
- name: run
emptyDir: {}
- name: tmp
emptyDir: {}Example 2: Minimal Application Pod
apiVersion: v1
kind: Pod
metadata:
name: app-minimal
spec:
securityContext:
runAsUser: 1000
runAsGroup: 1000
runAsNonRoot: true
containers:
- name: app
image: python:3.12-slim
command: ["python", "-m", "http.server", "8080"]
ports:
- containerPort: 8080
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir: {}Example 3: Deployment with Capability Restrictions
apiVersion: apps/v1
kind: Deployment
metadata:
name: secure-api
spec:
replicas: 3
selector:
matchLabels:
app: secure-api
template:
metadata:
labels:
app: secure-api
spec:
securityContext:
runAsUser: 10001
runAsGroup: 10001
runAsNonRoot: true
fsGroup: 10001
seccompProfile:
type: RuntimeDefault
containers:
- name: api
image: myapp:latest
ports:
- containerPort: 8080
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
resources:
limits:
memory: "256Mi"
cpu: "500m"
requests:
memory: "128Mi"
cpu: "250m"Verifying Capabilities
# Check capabilities of a running container
kubectl exec <pod> -- cat /proc/1/status | grep Cap
# Example output:
# CapInh: 0000000000000000
# CapPrm: 00000000a80425fb
# CapEff: 00000000a80425fb
# CapBnd: 00000000a80425fb
# CapAmb: 0000000000000000
# Decode the hex value
# On the host:
capsh --decode=00000000a80425fb
# Output:
# 0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,...
# Check if a specific capability is present
kubectl exec <pod> -- grep Cap /proc/1/status
# For a pod with drop ALL + add NET_BIND_SERVICE:
# CapPrm: 0000000000000400
# capsh --decode=0000000000000400
# 0x0000000000000400=cap_net_bind_serviceBest Practices for Minimal Capabilities
Capability Hardening Checklist
- Always start with
drop: ALL-- then add back only what's needed - Never add
SYS_ADMIN-- it's nearly equivalent to running privileged - Avoid
NET_RAWunless the app genuinely needs raw sockets (ping) - Set
allowPrivilegeEscalation: false-- prevents gaining new capabilities via setuid binaries - Run as non-root (
runAsNonRoot: true) -- most apps don't need root - Use
readOnlyRootFilesystem: true-- prevents filesystem modification - Test incrementally -- add one capability at a time until the app works
- Document why each added capability is needed
How to Determine Required Capabilities
If you are unsure which capabilities your application needs:
- Start with
drop: ALLand no additions - Run the pod and check if it works
- If it fails, check the error messages:
- "Permission denied" on bind() -> needs
NET_BIND_SERVICE - "Operation not permitted" on chown() -> needs
CHOWN - "Permission denied" on raw socket -> needs
NET_RAW
- "Permission denied" on bind() -> needs
- Add the minimum capability needed and repeat
- Alternatively, use
straceorauditdto trace what operations fail
Quick Reference
Exam Speed Reference
Drop ALL and add specific:
securityContext:
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICEKey facts:
- Capabilities use uppercase names without
CAP_prefix in Kubernetes dropandaddare lists of stringsdrop: ["ALL"]drops all capabilities- Capabilities are container-level, not pod-level
allowPrivilegeEscalation: falseprevents gaining new capabilities- The
ALLkeyword is special -- it means all capabilities
Key Exam Takeaways
- Always drop ALL capabilities and add back only what's needed
- Capabilities in Kubernetes use uppercase names without the
CAP_prefix SYS_ADMINis essentially root -- never grant it- Set
allowPrivilegeEscalation: falsealongside capability restrictions - Capabilities are set at the container level, not pod level
- Default containers get ~14 capabilities -- far more than most apps need
- Combine capability restrictions with
runAsNonRootandreadOnlyRootFilesystem