Kubernetes has quickly become the de facto standard for container orchestration, but there’s a massive gap between running a demo app and operating production workloads at scale. Over the past year, I’ve been running critical services on Kubernetes, and I’ve learned that production readiness requires patterns that go far beyond kubectl apply.
Today I want to share the advanced patterns that have made the difference between a fragile container platform and a reliable production system.
Health Checks That Actually Work
Everyone knows about liveness and readiness probes, but I see them misused constantly. Here’s what I’ve learned:
Liveness probes answer: “Is this container hopelessly broken?” They should only fail when the container needs to be killed and restarted. I’ve seen teams use liveness probes for dependency checks—bad idea. If your database is down, killing all your app pods won’t help.
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
Readiness probes answer: “Is this container ready to serve traffic?” These should check dependencies. If your service can’t reach the database, it shouldn’t receive requests.
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
Here’s how I implement the actual health endpoints:
func livenessHandler(w http.ResponseWriter, r *http.Request) {
// Only check if the service itself is working
// Don't check dependencies
if !isServiceFunctional() {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
}
func readinessHandler(w http.ResponseWriter, r *http.Request) {
// Check if we can serve traffic
// This includes dependency checks
if !canHandleRequests() {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
}
Resource Management: Requests vs Limits
Getting resource management right is critical. I’ve seen production outages from both under-provisioning and over-provisioning.
Resource requests tell the scheduler how much CPU/memory you need. This guarantees your pod gets these resources.
Resource limits define the maximum your pod can use. For CPU, you’re throttled. For memory, you’re killed (OOMKilled).
Here’s my approach:
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
The pattern I follow:
- Set requests based on typical usage (90th percentile)
- Set memory limits at 2x requests (headroom for spikes)
- Be careful with CPU limits—they can cause unexpected throttling
I monitor actual usage and adjust over time. There’s no perfect formula; it’s empirical.
Pod Disruption Budgets
This is one of the most underutilized features I see. Pod Disruption Budgets (PDBs) ensure that cluster maintenance doesn’t take down your service.
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: my-service-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-service
This ensures that during voluntary disruptions (node drains, cluster upgrades), at least 2 pods remain available. Without PDBs, Kubernetes might drain nodes and take down your entire service.
StatefulSets for Stateful Workloads
Not everything fits the stateless model. For databases, queues, and other stateful services, I use StatefulSets instead of Deployments.
Key differences:
- Pods get stable network identities (my-service-0, my-service-1)
- Stable persistent storage
- Ordered, graceful deployment and scaling
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: database
spec:
serviceName: "database"
replicas: 3
selector:
matchLabels:
app: database
template:
metadata:
labels:
app: database
spec:
containers:
- name: db
image: postgres:9.6
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
Each pod gets its own PersistentVolumeClaim that persists across pod restarts. This is essential for running databases in Kubernetes.
Init Containers for Setup Tasks
I use init containers for setup tasks that must complete before the main application starts:
spec:
initContainers:
- name: migration
image: my-app:latest
command: ['./run-migrations.sh']
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
containers:
- name: app
image: my-app:latest
This ensures database migrations run before the application starts. Init containers run sequentially and must complete successfully.
ConfigMaps and Secrets Management
Externalizing configuration is crucial for portability. I use ConfigMaps for non-sensitive data and Secrets for sensitive data.
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
log_level: "info"
feature_flags: "new-ui:true,beta:false"
---
apiVersion: v1
kind: Secret
metadata:
name: app-secrets
type: Opaque
data:
api_key: <base64-encoded-value>
Then mount them in pods:
containers:
- name: app
image: my-app:latest
envFrom:
- configMapRef:
name: app-config
- secretRef:
name: app-secrets
Important: Kubernetes Secrets are base64-encoded, not encrypted. For true encryption, use external secret management systems and inject secrets at runtime.
Rolling Updates and Rollback Strategies
Deployments support rolling updates out of the box, but the defaults aren’t always optimal.
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
This means:
- At most 1 pod can be unavailable during updates
- At most 1 extra pod can be created during updates
For critical services, I often use maxUnavailable: 0 to ensure no capacity loss during deployments.
Rollbacks are simple:
kubectl rollout undo deployment/my-service
But I prefer declarative rollbacks—revert the manifest in git and reapply.
Network Policies for Microsegmentation
By default, all pods can talk to all pods. For security, I implement network policies to restrict traffic:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-policy
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
This enforces that only the frontend can call the API, and the API can only call the database. It’s zero-trust at the network level.
Horizontal Pod Autoscaling
For variable workloads, I use HPA to automatically scale based on CPU or custom metrics:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: my-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-service
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 70
When average CPU exceeds 70%, Kubernetes adds pods. When it drops, it scales down.
For more sophisticated scaling, I use custom metrics (queue depth, request latency) via the metrics API.
Observability: Logs, Metrics, Traces
Production Kubernetes requires comprehensive observability:
Logs: I ship all container stdout/stderr to a centralized logging system. Never SSH into pods to read logs.
Metrics: I expose Prometheus metrics from every service and use node exporters for system metrics.
Traces: I instrument requests with distributed tracing to understand cross-service latencies.
Namespace Organization
I organize workloads into namespaces by environment and team:
apiVersion: v1
kind: Namespace
metadata:
name: production-api
labels:
environment: production
team: backend
This provides:
- Resource isolation
- Access control (RBAC per namespace)
- Resource quotas per team
Production Checklist
Before I deploy to production, I verify:
- Liveness and readiness probes configured
- Resource requests and limits set
- Multiple replicas for high availability
- Pod disruption budget defined
- Network policies enforced
- Secrets not hardcoded in manifests
- Logging and monitoring configured
- Deployment strategy tested
- Rollback procedure documented
Lessons Learned
Running Kubernetes in production has taught me that the platform is powerful but complex. The advanced patterns I’ve shared aren’t optional—they’re necessary for reliability.
The biggest shift in mindset is embracing declarative configuration and automation. Don’t kubectl exec into pods. Don’t manually edit resources. Treat your manifests as code, version them, review them, and test them.
Kubernetes gives you the primitives for building resilient systems. It’s up to you to use them correctly.