Kubernetes has quickly become the de facto standard for container orchestration, but there’s a massive gap between running a demo app and operating production workloads at scale. Over the past year, I’ve been running critical services on Kubernetes, and I’ve learned that production readiness requires patterns that go far beyond kubectl apply.

Today I want to share the advanced patterns that have made the difference between a fragile container platform and a reliable production system.

Health Checks That Actually Work

Everyone knows about liveness and readiness probes, but I see them misused constantly. Here’s what I’ve learned:

Liveness probes answer: “Is this container hopelessly broken?” They should only fail when the container needs to be killed and restarted. I’ve seen teams use liveness probes for dependency checks—bad idea. If your database is down, killing all your app pods won’t help.

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Readiness probes answer: “Is this container ready to serve traffic?” These should check dependencies. If your service can’t reach the database, it shouldn’t receive requests.

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2

Here’s how I implement the actual health endpoints:

func livenessHandler(w http.ResponseWriter, r *http.Request) {
    // Only check if the service itself is working
    // Don't check dependencies
    if !isServiceFunctional() {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
}

func readinessHandler(w http.ResponseWriter, r *http.Request) {
    // Check if we can serve traffic
    // This includes dependency checks
    if !canHandleRequests() {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
}

Resource Management: Requests vs Limits

Getting resource management right is critical. I’ve seen production outages from both under-provisioning and over-provisioning.

Resource requests tell the scheduler how much CPU/memory you need. This guarantees your pod gets these resources.

Resource limits define the maximum your pod can use. For CPU, you’re throttled. For memory, you’re killed (OOMKilled).

Here’s my approach:

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "500m"

The pattern I follow:

  • Set requests based on typical usage (90th percentile)
  • Set memory limits at 2x requests (headroom for spikes)
  • Be careful with CPU limits—they can cause unexpected throttling

I monitor actual usage and adjust over time. There’s no perfect formula; it’s empirical.

Pod Disruption Budgets

This is one of the most underutilized features I see. Pod Disruption Budgets (PDBs) ensure that cluster maintenance doesn’t take down your service.

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: my-service-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-service

This ensures that during voluntary disruptions (node drains, cluster upgrades), at least 2 pods remain available. Without PDBs, Kubernetes might drain nodes and take down your entire service.

StatefulSets for Stateful Workloads

Not everything fits the stateless model. For databases, queues, and other stateful services, I use StatefulSets instead of Deployments.

Key differences:

  • Pods get stable network identities (my-service-0, my-service-1)
  • Stable persistent storage
  • Ordered, graceful deployment and scaling
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: database
spec:
  serviceName: "database"
  replicas: 3
  selector:
    matchLabels:
      app: database
  template:
    metadata:
      labels:
        app: database
    spec:
      containers:
      - name: db
        image: postgres:9.6
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi

Each pod gets its own PersistentVolumeClaim that persists across pod restarts. This is essential for running databases in Kubernetes.

Init Containers for Setup Tasks

I use init containers for setup tasks that must complete before the main application starts:

spec:
  initContainers:
  - name: migration
    image: my-app:latest
    command: ['./run-migrations.sh']
    env:
    - name: DATABASE_URL
      valueFrom:
        secretKeyRef:
          name: db-credentials
          key: url
  containers:
  - name: app
    image: my-app:latest

This ensures database migrations run before the application starts. Init containers run sequentially and must complete successfully.

ConfigMaps and Secrets Management

Externalizing configuration is crucial for portability. I use ConfigMaps for non-sensitive data and Secrets for sensitive data.

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  log_level: "info"
  feature_flags: "new-ui:true,beta:false"
---
apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
type: Opaque
data:
  api_key: <base64-encoded-value>

Then mount them in pods:

containers:
- name: app
  image: my-app:latest
  envFrom:
  - configMapRef:
      name: app-config
  - secretRef:
      name: app-secrets

Important: Kubernetes Secrets are base64-encoded, not encrypted. For true encryption, use external secret management systems and inject secrets at runtime.

Rolling Updates and Rollback Strategies

Deployments support rolling updates out of the box, but the defaults aren’t always optimal.

spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1

This means:

  • At most 1 pod can be unavailable during updates
  • At most 1 extra pod can be created during updates

For critical services, I often use maxUnavailable: 0 to ensure no capacity loss during deployments.

Rollbacks are simple:

kubectl rollout undo deployment/my-service

But I prefer declarative rollbacks—revert the manifest in git and reapply.

Network Policies for Microsegmentation

By default, all pods can talk to all pods. For security, I implement network policies to restrict traffic:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-policy
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: database
    ports:
    - protocol: TCP
      port: 5432

This enforces that only the frontend can call the API, and the API can only call the database. It’s zero-trust at the network level.

Horizontal Pod Autoscaling

For variable workloads, I use HPA to automatically scale based on CPU or custom metrics:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: my-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      targetAverageUtilization: 70

When average CPU exceeds 70%, Kubernetes adds pods. When it drops, it scales down.

For more sophisticated scaling, I use custom metrics (queue depth, request latency) via the metrics API.

Observability: Logs, Metrics, Traces

Production Kubernetes requires comprehensive observability:

Logs: I ship all container stdout/stderr to a centralized logging system. Never SSH into pods to read logs.

Metrics: I expose Prometheus metrics from every service and use node exporters for system metrics.

Traces: I instrument requests with distributed tracing to understand cross-service latencies.

Namespace Organization

I organize workloads into namespaces by environment and team:

apiVersion: v1
kind: Namespace
metadata:
  name: production-api
  labels:
    environment: production
    team: backend

This provides:

  • Resource isolation
  • Access control (RBAC per namespace)
  • Resource quotas per team

Production Checklist

Before I deploy to production, I verify:

  • Liveness and readiness probes configured
  • Resource requests and limits set
  • Multiple replicas for high availability
  • Pod disruption budget defined
  • Network policies enforced
  • Secrets not hardcoded in manifests
  • Logging and monitoring configured
  • Deployment strategy tested
  • Rollback procedure documented

Lessons Learned

Running Kubernetes in production has taught me that the platform is powerful but complex. The advanced patterns I’ve shared aren’t optional—they’re necessary for reliability.

The biggest shift in mindset is embracing declarative configuration and automation. Don’t kubectl exec into pods. Don’t manually edit resources. Treat your manifests as code, version them, review them, and test them.

Kubernetes gives you the primitives for building resilient systems. It’s up to you to use them correctly.