Running containers in production is very different from running them on your laptop. I’ve spent the last year operating containerized workloads at scale, and the learning curve has been steep. Container orchestration platforms like Kubernetes provide powerful primitives, but using them effectively requires understanding both the platform and distributed systems principles.

Today, I want to share the hard-won lessons I’ve learned about running production container workloads—the practices that make the difference between a stable system and one that keeps you up at night.

Resource Management

The most common mistake I see: not setting resource requests and limits properly.

Resource Requests and Limits

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: order-service
        image: order-service:v1.2.0
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"

Requests: What the container is guaranteed. Used for scheduling decisions.

Limits: Maximum the container can use. Exceeding memory limit kills the container. Exceeding CPU limit throttles it.

How to set them:

  1. Run load tests and monitor resource usage
  2. Set requests to typical usage
  3. Set limits to peak usage with some headroom
  4. Monitor and adjust based on actual usage
// Example: Monitor and recommend resource settings
type ResourceMonitor struct {
    metricsClient MetricsClient
}

func (m *ResourceMonitor) AnalyzeUsage(namespace, deployment string, days int) (*ResourceRecommendation, error) {
    // Query actual usage over time
    cpuUsage, err := m.metricsClient.QueryRange(fmt.Sprintf(
        `avg(rate(container_cpu_usage_seconds_total{namespace="%s",pod=~"%s-.*"}[5m]))`,
        namespace, deployment,
    ), time.Now().Add(-time.Duration(days)*24*time.Hour), time.Now())
    if err != nil {
        return nil, err
    }

    memUsage, err := m.metricsClient.QueryRange(fmt.Sprintf(
        `avg(container_memory_working_set_bytes{namespace="%s",pod=~"%s-.*"})`,
        namespace, deployment,
    ), time.Now().Add(-time.Duration(days)*24*time.Hour), time.Now())
    if err != nil {
        return nil, err
    }

    // Calculate percentiles
    cpuP50 := percentile(cpuUsage, 0.50)
    cpuP95 := percentile(cpuUsage, 0.95)
    memP50 := percentile(memUsage, 0.50)
    memP95 := percentile(memUsage, 0.95)

    return &ResourceRecommendation{
        CPURequest:    fmt.Sprintf("%.0fm", cpuP50*1000),
        CPULimit:      fmt.Sprintf("%.0fm", cpuP95*1.2*1000), // 20% headroom
        MemoryRequest: fmt.Sprintf("%.0fMi", memP50/(1024*1024)),
        MemoryLimit:   fmt.Sprintf("%.0fMi", memP95*1.2/(1024*1024)),
    }, nil
}

Quality of Service Classes

Kubernetes assigns QoS classes based on resource configuration:

Guaranteed: Requests == Limits for all containers. Highest priority.

Burstable: Requests < Limits or only requests set. Medium priority.

BestEffort: No requests or limits. Lowest priority, killed first under pressure.

For production workloads, use Guaranteed or Burstable. Never BestEffort.

Health Checks

Kubernetes needs to know if your application is healthy.

Liveness Probes

Restart containers that are deadlocked or hung:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Implement a health endpoint:

type HealthChecker struct {
    db    *sql.DB
    cache *redis.Client
}

func (h *HealthChecker) Check(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 3*time.Second)
    defer cancel()

    status := &HealthStatus{
        Status: "healthy",
        Checks: make(map[string]string),
    }

    // Check database connection
    if err := h.db.PingContext(ctx); err != nil {
        status.Status = "unhealthy"
        status.Checks["database"] = fmt.Sprintf("error: %v", err)
    } else {
        status.Checks["database"] = "ok"
    }

    // Check cache connection
    if err := h.cache.Ping(ctx).Err(); err != nil {
        status.Status = "unhealthy"
        status.Checks["cache"] = fmt.Sprintf("error: %v", err)
    } else {
        status.Checks["cache"] = "ok"
    }

    if status.Status == "unhealthy" {
        w.WriteHeader(http.StatusServiceUnavailable)
    } else {
        w.WriteHeader(http.StatusOK)
    }

    json.NewEncoder(w).Encode(status)
}

Important: Liveness probes should check internal health, not dependency health. If your database is down, you don’t want all your pods restarting—that makes things worse.

Readiness Probes

Remove pods from service when they can’t handle traffic:

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2

Unlike liveness, readiness should check dependencies:

func (h *HealthChecker) Ready(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
    defer cancel()

    // Check if we can reach critical dependencies
    if err := h.db.PingContext(ctx); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        json.NewEncoder(w).Encode(map[string]string{
            "status": "not ready",
            "reason": "database unavailable",
        })
        return
    }

    // Check if local caches are warmed up
    if !h.isCacheWarmed() {
        w.WriteHeader(http.StatusServiceUnavailable)
        json.NewEncoder(w).Encode(map[string]string{
            "status": "not ready",
            "reason": "cache warming",
        })
        return
    }

    w.WriteHeader(http.StatusOK)
    json.NewEncoder(w).Encode(map[string]string{"status": "ready"})
}

Graceful Shutdown

Handle SIGTERM properly to avoid dropping requests during deployment:

func main() {
    server := &http.Server{
        Addr:    ":8080",
        Handler: router,
    }

    // Start server in goroutine
    go func() {
        if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed {
            log.Fatalf("Server failed: %v", err)
        }
    }()

    // Wait for interrupt signal
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
    <-quit

    log.Println("Shutting down server...")

    // Give outstanding requests time to complete
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    if err := server.Shutdown(ctx); err != nil {
        log.Printf("Server forced to shutdown: %v", err)
    }

    log.Println("Server exited")
}

Configure Kubernetes to wait during termination:

spec:
  containers:
  - name: order-service
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 15"]
  terminationGracePeriodSeconds: 60

The sleep gives time for:

  1. Load balancer to remove pod from rotation
  2. In-flight requests to complete
  3. Service to clean up resources

Configuration Management

ConfigMaps for Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: order-service-config
data:
  app.properties: |
    database.host=postgres.default.svc.cluster.local
    database.port=5432
    cache.ttl=300
    feature.newCheckout=true
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  template:
    spec:
      containers:
      - name: order-service
        volumeMounts:
        - name: config
          mountPath: /etc/config
      volumes:
      - name: config
        configMap:
          name: order-service-config

Secrets for Sensitive Data

apiVersion: v1
kind: Secret
metadata:
  name: order-service-secrets
type: Opaque
stringData:
  database-password: "super-secret-password"
  api-key: "abc123def456"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  template:
    spec:
      containers:
      - name: order-service
        env:
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: order-service-secrets
              key: database-password
        - name: API_KEY
          valueFrom:
            secretKeyRef:
              name: order-service-secrets
              key: api-key

Never commit secrets to Git. Use external secret management:

# External Secrets example
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: order-service-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: SecretStore
  target:
    name: order-service-secrets
  data:
  - secretKey: database-password
    remoteRef:
      key: secret/order-service/database
      property: password

Deployment Strategies

Rolling Updates

Default strategy—replace pods gradually:

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # How many extra pods during update
      maxUnavailable: 0  # How many pods can be down

This ensures zero downtime during deployments.

Canary Deployments

Deploy to a small subset first:

# Stable version - 90% of traffic
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service-stable
  labels:
    version: stable
spec:
  replicas: 9
  template:
    metadata:
      labels:
        app: order-service
        version: stable
---
# Canary version - 10% of traffic
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service-canary
  labels:
    version: canary
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: order-service
        version: canary
    spec:
      containers:
      - name: order-service
        image: order-service:v1.3.0-canary

Monitor canary metrics. If good, gradually shift traffic.

Autoscaling

Horizontal Pod Autoscaler

Scale based on metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 60

Key configurations:

  • stabilizationWindowSeconds: Wait before scaling to avoid flapping
  • scaleUp policies: How aggressively to scale up (fast is good for traffic spikes)
  • scaleDown policies: How aggressively to scale down (slow prevents premature scale-down)

Custom Metrics

Scale based on application metrics:

metrics:
- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: "1000"

Implement metrics endpoint for custom metrics:

func (m *MetricsHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    // Expose metrics in Prometheus format
    metrics := []string{
        fmt.Sprintf("http_requests_per_second %f", m.getRequestRate()),
        fmt.Sprintf("queue_depth %d", m.getQueueDepth()),
        fmt.Sprintf("active_connections %d", m.getActiveConnections()),
    }

    w.Header().Set("Content-Type", "text/plain")
    for _, metric := range metrics {
        fmt.Fprintln(w, metric)
    }
}

Pod Disruption Budgets

Ensure availability during maintenance:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: order-service-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: order-service

This prevents Kubernetes from evicting too many pods during node maintenance or upgrades.

Resource Quotas and Limit Ranges

Prevent resource exhaustion:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "100"
    requests.memory: 200Gi
    limits.cpu: "200"
    limits.memory: 400Gi
    pods: "50"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: production-limits
  namespace: production
spec:
  limits:
  - max:
      cpu: "4"
      memory: 8Gi
    min:
      cpu: "100m"
      memory: 128Mi
    default:
      cpu: "500m"
      memory: 512Mi
    defaultRequest:
      cpu: "250m"
      memory: 256Mi
    type: Container

Networking

Network Policies

Control traffic between pods:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: order-service-netpol
spec:
  podSelector:
    matchLabels:
      app: order-service
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: api-gateway
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: postgres
    ports:
    - protocol: TCP
      port: 5432
  - to:
    - podSelector:
        matchLabels:
          app: redis
    ports:
    - protocol: TCP
      port: 6379

This implements zero-trust networking at the pod level.

Service Mesh Integration

For advanced traffic management, observability, and security, integrate a service mesh:

apiVersion: v1
kind: Service
metadata:
  name: order-service
  annotations:
    service.beta.kubernetes.io/inject-proxy: "true"
spec:
  selector:
    app: order-service
  ports:
  - port: 80
    targetPort: 8080

The service mesh provides:

  • Automatic mTLS between services
  • Traffic splitting for canaries
  • Advanced routing rules
  • Distributed tracing

Logging and Monitoring

Structured Logging

logger := log.New(os.Stdout, "", 0)

func logStructured(level, message string, fields map[string]interface{}) {
    entry := map[string]interface{}{
        "timestamp": time.Now().UTC().Format(time.RFC3339),
        "level":     level,
        "message":   message,
        "pod":       os.Getenv("HOSTNAME"),
        "namespace": os.Getenv("POD_NAMESPACE"),
    }

    for k, v := range fields {
        entry[k] = v
    }

    data, _ := json.Marshal(entry)
    logger.Println(string(data))
}

Centralized Logging

Ship logs to a central system:

# Fluentd DaemonSet for log collection
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: fluentd
  template:
    spec:
      containers:
      - name: fluentd
        image: fluentd:v1.14
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: containers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: containers
        hostPath:
          path: /var/lib/docker/containers

Best Practices Summary

Resource management: Always set requests and limits. Use monitoring to tune them.

Health checks: Implement proper liveness and readiness probes. Test them.

Graceful shutdown: Handle SIGTERM and drain connections before exiting.

Configuration: Use ConfigMaps and Secrets. Never hardcode configuration.

Deployments: Use rolling updates. Implement canary deployments for risky changes.

Autoscaling: Configure HPA for variable load. Set conservative scale-down policies.

Resilience: Use Pod Disruption Budgets to maintain availability.

Security: Implement Network Policies. Use service mesh for zero-trust networking.

Observability: Structured logging, metrics, and distributed tracing are essential.

Looking Forward

Container orchestration is maturing rapidly. Kubernetes has won the orchestration wars, and the ecosystem is building powerful abstractions on top of it. Service meshes are becoming standard. GitOps is automating deployments. Serverless containers are emerging.

But the fundamentals remain: understand resource management, implement proper health checks, handle failures gracefully, and maintain comprehensive observability.

The practices I’ve shared come from real production experience—late-night incidents, post-mortems, and gradual improvements. Start with these patterns, adapt them to your needs, and keep learning from your incidents.

Your production workloads will thank you.