Containers revolutionized application deployment, but they introduced new security challenges. Unlike virtual machines with hardware-level isolation, containers share the host kernel, creating a larger attack surface. After securing production container environments running thousands of workloads, Iβve learned which security practices matter most and how to implement them effectively.
Understanding Container Security Layers
Container security is multi-layered:
βββββββββββββββββββββββββββββββββββββββ
β Application Code & Dependencies β β Application Security
βββββββββββββββββββββββββββββββββββββββ€
β Container Image β β Image Security
βββββββββββββββββββββββββββββββββββββββ€
β Container Runtime β β Runtime Security
βββββββββββββββββββββββββββββββββββββββ€
β Host Operating System β β Host Security
βββββββββββββββββββββββββββββββββββββββ€
β Orchestration Platform β β Kubernetes Security
βββββββββββββββββββββββββββββββββββββββ
This post focuses on the runtime layerβhow containers execute and how to secure that execution.
Linux Namespaces: The Foundation
Containers use Linux namespaces for isolation. Understanding them is crucial:
PID Namespace: Process isolation
# See processes in container
docker exec mycontainer ps aux
# From host, see real PIDs
ps aux | grep container-process
Network Namespace: Network stack isolation
# Container has its own network stack
docker exec mycontainer ip addr
# Host can see all network namespaces
ip netns list
Mount Namespace: Filesystem isolation
# Container sees only its filesystem
docker exec mycontainer df -h
# Mount points don't affect host
User Namespace: User ID isolation (most critical for security)
# Root in container != root on host
docker run --user 1000:1000 myimage whoami
The User Namespace Problem
By default, root in a container is root on the host:
# Dockerfile
FROM ubuntu:20.04
RUN apt-get update && apt-get install -y vim
USER root
CMD ["/bin/bash"]
# This container runs as root
docker run -it myimage
# If container breaks out, has root on host
Solution: Use user namespaces
Enable user namespace remapping:
// /etc/docker/daemon.json
{
"userns-remap": "default"
}
sudo systemctl restart docker
# Now container root maps to unprivileged user on host
docker run -it ubuntu whoami # Shows 'root'
ps aux | grep -i ubuntu # Shows user 100000 on host
Better: Run as non-root explicitly
FROM ubuntu:20.04
# Create app user
RUN groupadd -r appuser && useradd -r -g appuser appuser
# Install dependencies as root
RUN apt-get update && apt-get install -y myapp
# Switch to non-root
USER appuser
CMD ["/usr/bin/myapp"]
Capabilities: Fine-Grained Privileges
Linux capabilities split root privileges into units. Drop unnecessary ones:
# Default Docker drops some capabilities but retains many
docker run --rm -it ubuntu sh -c 'cat /proc/self/status | grep Cap'
# Drop all, add only what's needed
docker run --rm -it \
--cap-drop=ALL \
--cap-add=NET_BIND_SERVICE \
nginx
In Kubernetes:
apiVersion: v1
kind: Pod
metadata:
name: secure-pod
spec:
containers:
- name: app
image: myapp:latest
securityContext:
runAsNonRoot: true
runAsUser: 1000
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE
readOnlyRootFilesystem: true
Read-Only Root Filesystem
Prevent runtime modifications:
apiVersion: v1
kind: Pod
metadata:
name: readonly-pod
spec:
containers:
- name: app
image: myapp:latest
securityContext:
readOnlyRootFilesystem: true
volumeMounts:
- name: tmp
mountPath: /tmp
- name: var-run
mountPath: /var/run
volumes:
- name: tmp
emptyDir: {}
- name: var-run
emptyDir: {}
Application code:
package main
import (
"io/ioutil"
"log"
"os"
)
func main() {
// This fails with read-only root filesystem
// err := ioutil.WriteFile("/app/data.txt", []byte("test"), 0644)
// This works - writing to mounted volume
err := ioutil.WriteFile("/tmp/data.txt", []byte("test"), 0644)
if err != nil {
log.Fatal(err)
}
// For caching, use /tmp or emptyDir volume
cacheDir := os.Getenv("CACHE_DIR")
if cacheDir == "" {
cacheDir = "/tmp/cache"
}
os.MkdirAll(cacheDir, 0755)
}
Seccomp Profiles
Seccomp (Secure Computing Mode) restricts system calls:
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls": [
{
"names": [
"accept",
"accept4",
"access",
"arch_prctl",
"bind",
"brk",
"clone",
"close",
"connect",
"dup",
"dup2",
"epoll_create",
"epoll_ctl",
"epoll_wait",
"execve",
"exit",
"exit_group",
"fstat",
"futex",
"getcwd",
"getpid",
"getppid",
"listen",
"mmap",
"mprotect",
"munmap",
"open",
"openat",
"read",
"rt_sigaction",
"rt_sigreturn",
"socket",
"stat",
"write"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
Apply in Kubernetes:
apiVersion: v1
kind: Pod
metadata:
name: seccomp-pod
annotations:
seccomp.security.alpha.kubernetes.io/pod: localhost/profiles/audit.json
spec:
containers:
- name: app
image: myapp:latest
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/myapp.json
AppArmor Profiles
AppArmor provides mandatory access control:
#include <tunables/global>
profile myapp flags=(attach_disconnected,mediate_deleted) {
#include <abstractions/base>
# Allow network access
network inet tcp,
network inet udp,
# Allow reading from /app
/app/** r,
# Allow writing to /tmp and /var
/tmp/** rw,
/var/run/** rw,
# Deny everything else
deny /etc/shadow r,
deny /root/** rw,
deny /home/** rw,
# Allow executing the application
/usr/bin/myapp ix,
}
Load and apply:
# Load AppArmor profile
sudo apparmor_parser -r -W /etc/apparmor.d/myapp
# Verify it's loaded
sudo aa-status | grep myapp
In Kubernetes:
apiVersion: v1
kind: Pod
metadata:
name: apparmor-pod
annotations:
container.apparmor.security.beta.kubernetes.io/app: localhost/myapp
spec:
containers:
- name: app
image: myapp:latest
Pod Security Standards
Kubernetes defines three Pod Security Standards:
Privileged: Unrestricted (not recommended)
Baseline: Minimally restrictive
Restricted: Heavily restricted (recommended)
Enforce with Pod Security Admission:
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
Example restricted pod:
apiVersion: v1
kind: Pod
metadata:
name: restricted-pod
namespace: production
spec:
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: myapp:latest
securityContext:
allowPrivilegeEscalation: false
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop:
- ALL
seccompProfile:
type: RuntimeDefault
resources:
limits:
cpu: "1"
memory: "512Mi"
requests:
cpu: "100m"
memory: "128Mi"
Runtime Security with Falco
Falco detects anomalous behavior:
# Install Falco
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm install falco falcosecurity/falco \
--namespace falco \
--create-namespace
Custom rules:
# /etc/falco/rules.d/custom_rules.yaml
- rule: Unauthorized Process in Container
desc: Detect unexpected processes
condition: >
spawned_process and
container and
not proc.name in (node, nginx, java)
output: >
Unexpected process started in container
(user=%user.name process=%proc.cmdline container=%container.name)
priority: WARNING
- rule: Write to Non-Temp Directory
desc: Detect writes outside /tmp
condition: >
open_write and
container and
not fd.directory in (/tmp, /var/tmp, /var/run)
output: >
File write outside temp directory
(file=%fd.name container=%container.name)
priority: ERROR
- rule: Outbound Connection to Suspicious IP
desc: Detect connections to blacklisted IPs
condition: >
outbound and
fd.sip in (suspicious_ips)
output: >
Suspicious outbound connection
(ip=%fd.rip container=%container.name)
priority: CRITICAL
Image Scanning
Scan images for vulnerabilities:
# Trivy scanning
trivy image myapp:latest
# In CI/CD pipeline
docker build -t myapp:latest .
trivy image --severity HIGH,CRITICAL --exit-code 1 myapp:latest
docker push myapp:latest
GitHub Actions example:
name: Container Security
on: [push]
jobs:
scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Build image
run: docker build -t myapp:${{ github.sha }} .
- name: Run Trivy
uses: aquasecurity/trivy-action@master
with:
image-ref: myapp:${{ github.sha }}
format: 'sarif'
output: 'trivy-results.sarif'
severity: 'CRITICAL,HIGH'
- name: Upload to Security tab
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: 'trivy-results.sarif'
Network Policies
Restrict network access:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-network-policy
namespace: production
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
- Egress
ingress:
# Allow from frontend only
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
# Allow to database
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
# Allow DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
Default deny all:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Secret Management
Never bake secrets into images:
# DON'T DO THIS
FROM ubuntu:20.04
ENV API_KEY=sk_live_secret123
COPY app.conf /etc/app.conf # Contains passwords
Use Kubernetes secrets:
apiVersion: v1
kind: Secret
metadata:
name: app-secrets
namespace: production
type: Opaque
data:
api-key: c2tfc2VjcmV0MTIz # base64 encoded
---
apiVersion: v1
kind: Pod
metadata:
name: app
spec:
containers:
- name: app
image: myapp:latest
env:
- name: API_KEY
valueFrom:
secretKeyRef:
name: app-secrets
key: api-key
Better: Use external secret management:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: app-secrets
namespace: production
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: SecretStore
target:
name: app-secrets
data:
- secretKey: api-key
remoteRef:
key: secret/data/production/api-key
Resource Limits
Prevent resource exhaustion:
apiVersion: v1
kind: Pod
metadata:
name: app
spec:
containers:
- name: app
image: myapp:latest
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "1000m"
Enforce with LimitRange:
apiVersion: v1
kind: LimitRange
metadata:
name: mem-cpu-limit-range
namespace: production
spec:
limits:
- max:
cpu: "2"
memory: "2Gi"
min:
cpu: "50m"
memory: "64Mi"
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
type: Container
Admission Controllers
Enforce policies at admission time:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: security-validation
webhooks:
- name: validate.security.example.com
clientConfig:
service:
name: security-validator
namespace: security
path: /validate
rules:
- operations: ["CREATE", "UPDATE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
admissionReviewVersions: ["v1"]
sideEffects: None
Webhook server:
func (v *Validator) validatePod(pod *corev1.Pod) error {
// Reject privileged containers
for _, container := range pod.Spec.Containers {
if container.SecurityContext != nil &&
container.SecurityContext.Privileged != nil &&
*container.SecurityContext.Privileged {
return fmt.Errorf("privileged containers not allowed")
}
// Require non-root
if container.SecurityContext == nil ||
container.SecurityContext.RunAsNonRoot == nil ||
!*container.SecurityContext.RunAsNonRoot {
return fmt.Errorf("containers must run as non-root")
}
// Check image registry
if !strings.HasPrefix(container.Image, "registry.example.com/") {
return fmt.Errorf("images must come from approved registry")
}
}
return nil
}
Security Best Practices Checklist
- Run as non-root user
- Use read-only root filesystem
- Drop all capabilities, add only needed ones
- Enable seccomp profiles
- Use AppArmor or SELinux
- Scan images for vulnerabilities
- Donβt include secrets in images
- Set resource limits
- Use network policies
- Enable Pod Security Standards
- Implement admission controllers
- Use runtime security (Falco)
- Keep base images minimal
- Regularly update images
- Monitor and audit container activity
Conclusion
Container security requires defense in depth. No single measure is sufficientβyou need multiple layers:
- Build time: Scan images, minimize base images, no secrets
- Deploy time: Enforce security policies with admission controllers
- Runtime: Monitor with Falco, enforce network policies, use seccomp/AppArmor
- Always: Run as non-root, drop capabilities, use read-only filesystems
Start with the basics (non-root, capabilities, resource limits) and progressively add more sophisticated controls (seccomp, AppArmor, runtime security). The goal is to make exploitation difficult even if an attacker compromises a container.
Security is not a checkboxβitβs a continuous process of hardening, monitoring, and responding to threats.