Distributed systems fail in unexpected ways. Rather than waiting for failures to happen in production at the worst possible time, chaos engineering proactively introduces controlled failures to discover weaknesses. After implementing chaos engineering practices across multiple production systems, I’ve learned how to build more resilient architectures and develop confidence in system behavior under stress.
The Principles of Chaos Engineering
Chaos engineering is based on running experiments that expose systemic weaknesses. The core principles are:
- Build a hypothesis around steady-state behavior: Define what “normal” looks like
- Vary real-world events: Introduce failures that could realistically occur
- Run experiments in production: Staging doesn’t capture production complexity
- Automate experiments: Make chaos continuous, not one-time events
- Minimize blast radius: Start small and expand gradually
Why Chaos Engineering Matters
Traditional testing approaches verify that systems work under expected conditions. Chaos engineering verifies they work under unexpected conditions:
- What happens when a database becomes unavailable?
- How does the system handle network latency?
- Can the application recover from a cascading failure?
- What’s the impact of a sudden traffic spike?
These scenarios are difficult to test conventionally but are inevitable in production.
Starting with Chaos: The Game Day
Before automating chaos, run manual game days:
# Game Day Runbook: Database Failover Test
## Objective
Verify application gracefully handles primary database failure
## Prerequisites
- Monitoring dashboards open
- Incident response team on standby
- Customer support notified
- Rollback plan documented
## Experiment Steps
1. Monitor baseline metrics (5 minutes)
2. Terminate primary database instance
3. Observe application behavior
4. Wait for automatic failover
5. Verify application recovery
6. Monitor for 15 minutes post-recovery
## Success Criteria
- Zero user-visible errors
- Automatic failover completes within 30 seconds
- All transactions preserved
- Monitoring alerts fire appropriately
## Rollback
- Promote standby to primary
- Restart application pods if needed
Run this with your team, document observations, and fix issues before automating.
Chaos Toolkit: Framework-Agnostic Experiments
Chaos Toolkit provides a declarative way to define experiments:
# experiment.yaml
version: 1.0.0
title: Pod failure doesn't impact service availability
description: Terminate random pod and verify service remains healthy
steady-state-hypothesis:
title: Service responds successfully
probes:
- type: probe
name: app-responds
tolerance: 200
provider:
type: http
url: https://myapp.example.com/health
timeout: 5
method:
- type: action
name: terminate-pod
provider:
type: python
module: chaosk8s.pod.actions
func: terminate_pods
arguments:
label_selector: app=myapp
ns: production
qty: 1
rand: true
- type: probe
name: app-still-responds
tolerance: 200
provider:
type: http
url: https://myapp.example.com/health
timeout: 5
rollbacks:
- type: action
name: scale-deployment
provider:
type: python
module: chaosk8s.deployment.actions
func: scale_deployment
arguments:
name: myapp
replicas: 3
ns: production
Run the experiment:
chaos run experiment.yaml
Litmus Chaos for Kubernetes
Litmus provides Kubernetes-native chaos experiments:
Install Litmus:
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v2.0.0.yaml
Define a chaos experiment:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-delete-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=myapp
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "false"
Create RBAC:
apiVersion: v1
kind: ServiceAccount
metadata:
name: litmus-admin
namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: litmus-admin
namespace: production
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["create", "delete", "get", "list", "patch", "update", "deletecollection"]
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets", "replicasets"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: litmus-admin
namespace: production
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: litmus-admin
subjects:
- kind: ServiceAccount
name: litmus-admin
namespace: production
Common Chaos Experiments
Pod Failure
Test application resilience to pod crashes:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: pod-delete
spec:
definition:
scope: Namespaced
permissions:
- apiGroups: [""]
resources: ["pods"]
verbs: ["delete", "get", "list"]
image: "litmuschaos/go-runner:latest"
args:
- -c
- ./experiments -name pod-delete
command:
- /bin/bash
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: FORCE
value: "true"
- name: CHAOS_INTERVAL
value: "10"
Network Latency
Introduce network delays:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: network-latency
spec:
appinfo:
appns: production
applabel: app=myapp
appkind: deployment
experiments:
- name: pod-network-latency
spec:
components:
env:
- name: NETWORK_LATENCY
value: "2000" # 2 seconds
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: TARGET_CONTAINER
value: "myapp"
- name: NETWORK_INTERFACE
value: "eth0"
Resource Exhaustion
Test behavior under CPU or memory pressure:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: memory-stress
spec:
appinfo:
appns: production
applabel: app=myapp
appkind: deployment
experiments:
- name: pod-memory-hog
spec:
components:
env:
- name: MEMORY_CONSUMPTION
value: "500" # MB
- name: TOTAL_CHAOS_DURATION
value: "60"
DNS Errors
Simulate DNS resolution failures:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: dns-chaos
spec:
appinfo:
appns: production
applabel: app=myapp
appkind: deployment
experiments:
- name: pod-dns-error
spec:
components:
env:
- name: TARGET_HOSTNAMES
value: "external-api.example.com"
- name: TOTAL_CHAOS_DURATION
value: "60"
Building Resilient Applications
Chaos experiments reveal weaknesses. Here’s how to address them:
Implement Retries with Exponential Backoff
func callExternalAPI(ctx context.Context, url string) (*Response, error) {
maxRetries := 3
baseDelay := 100 * time.Millisecond
for attempt := 0; attempt < maxRetries; attempt++ {
resp, err := http.Get(url)
if err == nil && resp.StatusCode < 500 {
return resp, nil
}
if attempt < maxRetries-1 {
delay := baseDelay * time.Duration(1<<uint(attempt))
select {
case <-time.After(delay):
continue
case <-ctx.Done():
return nil, ctx.Err()
}
}
}
return nil, fmt.Errorf("max retries exceeded")
}
Add Circuit Breakers
import "github.com/sony/gobreaker"
var cb *gobreaker.CircuitBreaker
func init() {
settings := gobreaker.Settings{
Name: "ExternalAPI",
MaxRequests: 3,
Interval: time.Minute,
Timeout: 30 * time.Second,
ReadyToTrip: func(counts gobreaker.Counts) bool {
failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
return counts.Requests >= 3 && failureRatio >= 0.6
},
}
cb = gobreaker.NewCircuitBreaker(settings)
}
func callWithCircuitBreaker(url string) (*Response, error) {
result, err := cb.Execute(func() (interface{}, error) {
return http.Get(url)
})
if err != nil {
return nil, err
}
return result.(*Response), nil
}
Implement Timeouts
func callWithTimeout(url string) (*Response, error) {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
if err != nil {
return nil, err
}
client := &http.Client{}
return client.Do(req)
}
Graceful Degradation
type Service struct {
cache Cache
db Database
}
func (s *Service) GetUser(id string) (*User, error) {
// Try cache first
if user, err := s.cache.Get(id); err == nil {
return user, nil
}
// Fallback to database
user, err := s.db.GetUser(id)
if err != nil {
// Return degraded response
return &User{
ID: id,
Name: "User data temporarily unavailable",
}, nil
}
// Update cache
s.cache.Set(id, user)
return user, nil
}
Observability for Chaos
Effective chaos engineering requires comprehensive observability:
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
requestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint", "status"},
)
errorCounter = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_errors_total",
Help: "Total HTTP errors",
},
[]string{"method", "endpoint", "type"},
)
)
func instrumentedHandler(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}
next.ServeHTTP(wrapped, r)
duration := time.Since(start).Seconds()
requestDuration.WithLabelValues(
r.Method,
r.URL.Path,
fmt.Sprintf("%d", wrapped.statusCode),
).Observe(duration)
if wrapped.statusCode >= 500 {
errorCounter.WithLabelValues(
r.Method,
r.URL.Path,
"server_error",
).Inc()
}
})
}
Automated Continuous Chaos
Schedule regular chaos experiments:
apiVersion: batch/v1
kind: CronJob
metadata:
name: chaos-experiment
namespace: litmus
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
serviceAccountName: litmus-admin
containers:
- name: chaos
image: litmuschaos/litmus-checker:latest
command: ["/bin/bash"]
args:
- -c
- |
kubectl apply -f /experiments/pod-delete.yaml
sleep 300
kubectl delete chaosengine pod-delete-chaos
restartPolicy: OnFailure
Progressive Blast Radius
Start small and expand:
# Week 1: Single non-production pod
experiments:
- name: pod-delete
spec:
components:
env:
- name: TARGET_PODS
value: "1"
# Week 2: Multiple non-production pods
experiments:
- name: pod-delete
spec:
components:
env:
- name: TARGET_PODS
value: "3"
# Week 3: Production with low traffic
# Week 4: Full production deployment
Safety Controls
Implement guardrails:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: safe-chaos
spec:
appinfo:
appns: production
applabel: app=myapp
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
# Automatically stop if hypothesis fails
terminationGracePeriodSeconds: 30
experiments:
- name: pod-delete
spec:
probe:
- name: check-service-health
type: httpProbe
httpProbe/inputs:
url: https://myapp.example.com/health
expectedResponseCode: "200"
mode: Continuous
runProperties:
probeTimeout: 5
interval: 2
retry: 3
Metrics and Success Criteria
Define what success looks like:
metrics:
- name: availability
threshold: ">= 99.9%"
query: |
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
- name: latency_p99
threshold: "<= 500ms"
query: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m]))
- name: error_rate
threshold: "<= 0.1%"
query: |
sum(rate(http_errors_total[5m])) /
sum(rate(http_requests_total[5m]))
Building a Chaos Engineering Culture
Technical implementation is only part of the challenge:
- Get buy-in: Demonstrate value with small, safe experiments
- Start outside production: Build confidence in staging first
- Run game days: Make chaos engineering a team activity
- Document everything: Maintain runbooks and post-mortems
- Celebrate learnings: Focus on improvements, not blame
- Make it continuous: Automate experiments over time
Common Pitfalls
Avoid these mistakes:
- Too much too soon: Start with gentle experiments
- No hypothesis: Define expected behavior first
- Poor observability: You can’t learn if you can’t measure
- Ignoring findings: Act on discovered weaknesses
- Running once: Make chaos continuous
- Production YOLO: Build confidence in lower environments first
Conclusion
Chaos engineering transforms how we think about system reliability. Instead of hoping systems will handle failures gracefully, we verify they actually do through controlled experiments.
Start small:
- Define steady-state metrics
- Run manual game days
- Automate simple experiments (pod deletion)
- Gradually expand scope and complexity
- Make chaos continuous
The goal isn’t to cause outages—it’s to discover weaknesses before they cause outages. Every chaos experiment is an opportunity to build a more resilient system and develop confidence in its behavior under stress.
Remember: You don’t find out if your parachute works by jumping out of a plane. You test it on the ground first. Chaos engineering is testing your parachutes.