Container Runtime Security: Protecting the Foundation

Containers revolutionized application deployment, but they introduced new security challenges. Unlike virtual machines with hardware-level isolation, containers share the host kernel, creating a larger attack surface. After securing production container environments running thousands of workloads, I’ve learned which security practices matter most and how to implement them effectively.

Understanding Container Security Layers

Container security is multi-layered:

┌─────────────────────────────────────┐
│     Application Code & Dependencies │  ← Application Security
├─────────────────────────────────────┤
│        Container Image              │  ← Image Security
├─────────────────────────────────────┤
│       Container Runtime             │  ← Runtime Security
├─────────────────────────────────────┤
│    Host Operating System            │  ← Host Security
├─────────────────────────────────────┤
│     Orchestration Platform          │  ← Kubernetes Security
└─────────────────────────────────────┘

This post focuses on the runtime layer—how containers execute and how to secure that execution.

Linux Namespaces: The Foundation

Containers use Linux namespaces for isolation. Understanding them is crucial:

PID Namespace: Process isolation

# See processes in container
docker exec mycontainer ps aux

# From host, see real PIDs
ps aux | grep container-process

Network Namespace: Network stack isolation

# Container has its own network stack
docker exec mycontainer ip addr

# Host can see all network namespaces
ip netns list

Mount Namespace: Filesystem isolation

# Container sees only its filesystem
docker exec mycontainer df -h

# Mount points don't affect host

User Namespace: User ID isolation (most critical for security)

# Root in container != root on host
docker run --user 1000:1000 myimage whoami

The User Namespace Problem

By default, root in a container is root on the host:

# Dockerfile
FROM ubuntu:20.04
RUN apt-get update && apt-get install -y vim
USER root
CMD ["/bin/bash"]

# This container runs as root
docker run -it myimage

# If container breaks out, has root on host

Solution: Use user namespaces

Enable user namespace remapping:

// /etc/docker/daemon.json
{
  "userns-remap": "default"
}

sudo systemctl restart docker

# Now container root maps to unprivileged user on host
docker run -it ubuntu whoami  # Shows 'root'
ps aux | grep -i ubuntu  # Shows user 100000 on host

Better: Run as non-root explicitly

FROM ubuntu:20.04

# Create app user
RUN groupadd -r appuser && useradd -r -g appuser appuser

# Install dependencies as root
RUN apt-get update && apt-get install -y myapp

# Switch to non-root
USER appuser

CMD ["/usr/bin/myapp"]

Capabilities: Fine-Grained Privileges

Linux capabilities split root privileges into units. Drop unnecessary ones:

# Default Docker drops some capabilities but retains many
docker run --rm -it ubuntu sh -c 'cat /proc/self/status | grep Cap'

# Drop all, add only what's needed
docker run --rm -it \
  --cap-drop=ALL \
  --cap-add=NET_BIND_SERVICE \
  nginx

In Kubernetes:

apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
spec:
  containers:
    - name: app
      image: myapp:latest
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
          add:
            - NET_BIND_SERVICE
        readOnlyRootFilesystem: true

Read-Only Root Filesystem

Prevent runtime modifications:

apiVersion: v1
kind: Pod
metadata:
  name: readonly-pod
spec:
  containers:
    - name: app
      image: myapp:latest
      securityContext:
        readOnlyRootFilesystem: true
      volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: var-run
          mountPath: /var/run
  volumes:
    - name: tmp
      emptyDir: {}
    - name: var-run
      emptyDir: {}

Application code:

package main

import (
    "io/ioutil"
    "log"
    "os"
)

func main() {
    // This fails with read-only root filesystem
    // err := ioutil.WriteFile("/app/data.txt", []byte("test"), 0644)

    // This works - writing to mounted volume
    err := ioutil.WriteFile("/tmp/data.txt", []byte("test"), 0644)
    if err != nil {
        log.Fatal(err)
    }

    // For caching, use /tmp or emptyDir volume
    cacheDir := os.Getenv("CACHE_DIR")
    if cacheDir == "" {
        cacheDir = "/tmp/cache"
    }
    os.MkdirAll(cacheDir, 0755)
}

Seccomp Profiles

Seccomp (Secure Computing Mode) restricts system calls:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": [
    "SCMP_ARCH_X86_64",
    "SCMP_ARCH_X86",
    "SCMP_ARCH_X32"
  ],
  "syscalls": [
    {
      "names": [
        "accept",
        "accept4",
        "access",
        "arch_prctl",
        "bind",
        "brk",
        "clone",
        "close",
        "connect",
        "dup",
        "dup2",
        "epoll_create",
        "epoll_ctl",
        "epoll_wait",
        "execve",
        "exit",
        "exit_group",
        "fstat",
        "futex",
        "getcwd",
        "getpid",
        "getppid",
        "listen",
        "mmap",
        "mprotect",
        "munmap",
        "open",
        "openat",
        "read",
        "rt_sigaction",
        "rt_sigreturn",
        "socket",
        "stat",
        "write"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Apply in Kubernetes:

apiVersion: v1
kind: Pod
metadata:
  name: seccomp-pod
  annotations:
    seccomp.security.alpha.kubernetes.io/pod: localhost/profiles/audit.json
spec:
  containers:
    - name: app
      image: myapp:latest
      securityContext:
        seccompProfile:
          type: Localhost
          localhostProfile: profiles/myapp.json

AppArmor Profiles

AppArmor provides mandatory access control:

#include <tunables/global>

profile myapp flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>

  # Allow network access
  network inet tcp,
  network inet udp,

  # Allow reading from /app
  /app/** r,

  # Allow writing to /tmp and /var
  /tmp/** rw,
  /var/run/** rw,

  # Deny everything else
  deny /etc/shadow r,
  deny /root/** rw,
  deny /home/** rw,

  # Allow executing the application
  /usr/bin/myapp ix,
}

Load and apply:

# Load AppArmor profile
sudo apparmor_parser -r -W /etc/apparmor.d/myapp

# Verify it's loaded
sudo aa-status | grep myapp

In Kubernetes:

apiVersion: v1
kind: Pod
metadata:
  name: apparmor-pod
  annotations:
    container.apparmor.security.beta.kubernetes.io/app: localhost/myapp
spec:
  containers:
    - name: app
      image: myapp:latest

Pod Security Standards

Kubernetes defines three Pod Security Standards:

Privileged: Unrestricted (not recommended)

Baseline: Minimally restrictive

Restricted: Heavily restricted (recommended)

Enforce with Pod Security Admission:

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Example restricted pod:

apiVersion: v1
kind: Pod
metadata:
  name: restricted-pod
  namespace: production
spec:
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      image: myapp:latest
      securityContext:
        allowPrivilegeEscalation: false
        runAsNonRoot: true
        runAsUser: 1000
        capabilities:
          drop:
            - ALL
        seccompProfile:
          type: RuntimeDefault
      resources:
        limits:
          cpu: "1"
          memory: "512Mi"
        requests:
          cpu: "100m"
          memory: "128Mi"

Runtime Security with Falco

Falco detects anomalous behavior:

# Install Falco
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm install falco falcosecurity/falco \
  --namespace falco \
  --create-namespace

Custom rules:

# /etc/falco/rules.d/custom_rules.yaml
- rule: Unauthorized Process in Container
  desc: Detect unexpected processes
  condition: >
    spawned_process and
    container and
    not proc.name in (node, nginx, java)
  output: >
    Unexpected process started in container
    (user=%user.name process=%proc.cmdline container=%container.name)
  priority: WARNING

- rule: Write to Non-Temp Directory
  desc: Detect writes outside /tmp
  condition: >
    open_write and
    container and
    not fd.directory in (/tmp, /var/tmp, /var/run)
  output: >
    File write outside temp directory
    (file=%fd.name container=%container.name)
  priority: ERROR

- rule: Outbound Connection to Suspicious IP
  desc: Detect connections to blacklisted IPs
  condition: >
    outbound and
    fd.sip in (suspicious_ips)
  output: >
    Suspicious outbound connection
    (ip=%fd.rip container=%container.name)
  priority: CRITICAL

Image Scanning

Scan images for vulnerabilities:

# Trivy scanning
trivy image myapp:latest

# In CI/CD pipeline
docker build -t myapp:latest .
trivy image --severity HIGH,CRITICAL --exit-code 1 myapp:latest
docker push myapp:latest

GitHub Actions example:

name: Container Security
on: [push]

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Build image
        run: docker build -t myapp:${{ github.sha }} .

      - name: Run Trivy
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: myapp:${{ github.sha }}
          format: 'sarif'
          output: 'trivy-results.sarif'
          severity: 'CRITICAL,HIGH'

      - name: Upload to Security tab
        uses: github/codeql-action/upload-sarif@v2
        with:
          sarif_file: 'trivy-results.sarif'

Network Policies

Restrict network access:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-network-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Ingress
    - Egress
  ingress:
    # Allow from frontend only
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - protocol: TCP
          port: 8080
  egress:
    # Allow to database
    - to:
        - podSelector:
            matchLabels:
              app: database
      ports:
        - protocol: TCP
          port: 5432
    # Allow DNS
    - to:
        - namespaceSelector:
            matchLabels:
              name: kube-system
      ports:
        - protocol: UDP
          port: 53

Default deny all:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Secret Management

Never bake secrets into images:

# DON'T DO THIS
FROM ubuntu:20.04
ENV API_KEY=sk_live_secret123
COPY app.conf /etc/app.conf  # Contains passwords

Use Kubernetes secrets:

apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
  namespace: production
type: Opaque
data:
  api-key: c2tfc2VjcmV0MTIz  # base64 encoded
---
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
    - name: app
      image: myapp:latest
      env:
        - name: API_KEY
          valueFrom:
            secretKeyRef:
              name: app-secrets
              key: api-key

Better: Use external secret management:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-secrets
  namespace: production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: SecretStore
  target:
    name: app-secrets
  data:
    - secretKey: api-key
      remoteRef:
        key: secret/data/production/api-key

Resource Limits

Prevent resource exhaustion:

apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
    - name: app
      image: myapp:latest
      resources:
        requests:
          memory: "128Mi"
          cpu: "100m"
        limits:
          memory: "512Mi"
          cpu: "1000m"

Enforce with LimitRange:

apiVersion: v1
kind: LimitRange
metadata:
  name: mem-cpu-limit-range
  namespace: production
spec:
  limits:
    - max:
        cpu: "2"
        memory: "2Gi"
      min:
        cpu: "50m"
        memory: "64Mi"
      default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      type: Container

Admission Controllers

Enforce policies at admission time:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: security-validation
webhooks:
  - name: validate.security.example.com
    clientConfig:
      service:
        name: security-validator
        namespace: security
        path: /validate
    rules:
      - operations: ["CREATE", "UPDATE"]
        apiGroups: [""]
        apiVersions: ["v1"]
        resources: ["pods"]
    admissionReviewVersions: ["v1"]
    sideEffects: None

Webhook server:

func (v *Validator) validatePod(pod *corev1.Pod) error {
    // Reject privileged containers
    for _, container := range pod.Spec.Containers {
        if container.SecurityContext != nil &&
           container.SecurityContext.Privileged != nil &&
           *container.SecurityContext.Privileged {
            return fmt.Errorf("privileged containers not allowed")
        }

        // Require non-root
        if container.SecurityContext == nil ||
           container.SecurityContext.RunAsNonRoot == nil ||
           !*container.SecurityContext.RunAsNonRoot {
            return fmt.Errorf("containers must run as non-root")
        }

        // Check image registry
        if !strings.HasPrefix(container.Image, "registry.example.com/") {
            return fmt.Errorf("images must come from approved registry")
        }
    }

    return nil
}

Security Best Practices Checklist

Conclusion

Container security requires defense in depth. No single measure is sufficient—you need multiple layers:

Build time: Scan images, minimize base images, no secrets
Deploy time: Enforce security policies with admission controllers
Runtime: Monitor with Falco, enforce network policies, use seccomp/AppArmor
Always: Run as non-root, drop capabilities, use read-only filesystems

Start with the basics (non-root, capabilities, resource limits) and progressively add more sophisticated controls (seccomp, AppArmor, runtime security). The goal is to make exploitation difficult even if an attacker compromises a container.

Security is not a checkbox—it’s a continuous process of hardening, monitoring, and responding to threats.