Chaos Engineering Fundamentals: Building Resilient Distributed Systems

Distributed systems fail in unexpected ways. Rather than waiting for failures to happen in production at the worst possible time, chaos engineering proactively introduces controlled failures to discover weaknesses. After implementing chaos engineering practices across multiple production systems, I’ve learned how to build more resilient architectures and develop confidence in system behavior under stress.

The Principles of Chaos Engineering

Chaos engineering is based on running experiments that expose systemic weaknesses. The core principles are:

Build a hypothesis around steady-state behavior: Define what “normal” looks like
Vary real-world events: Introduce failures that could realistically occur
Run experiments in production: Staging doesn’t capture production complexity
Automate experiments: Make chaos continuous, not one-time events
Minimize blast radius: Start small and expand gradually

Why Chaos Engineering Matters

Traditional testing approaches verify that systems work under expected conditions. Chaos engineering verifies they work under unexpected conditions:

What happens when a database becomes unavailable?
How does the system handle network latency?
Can the application recover from a cascading failure?
What’s the impact of a sudden traffic spike?

These scenarios are difficult to test conventionally but are inevitable in production.

Starting with Chaos: The Game Day

Before automating chaos, run manual game days:

# Game Day Runbook: Database Failover Test

## Objective
Verify application gracefully handles primary database failure

## Prerequisites
- Monitoring dashboards open
- Incident response team on standby
- Customer support notified
- Rollback plan documented

## Experiment Steps
1. Monitor baseline metrics (5 minutes)
2. Terminate primary database instance
3. Observe application behavior
4. Wait for automatic failover
5. Verify application recovery
6. Monitor for 15 minutes post-recovery

## Success Criteria
- Zero user-visible errors
- Automatic failover completes within 30 seconds
- All transactions preserved
- Monitoring alerts fire appropriately

## Rollback
- Promote standby to primary
- Restart application pods if needed

Run this with your team, document observations, and fix issues before automating.

Chaos Toolkit: Framework-Agnostic Experiments

Chaos Toolkit provides a declarative way to define experiments:

# experiment.yaml
version: 1.0.0
title: Pod failure doesn't impact service availability
description: Terminate random pod and verify service remains healthy

steady-state-hypothesis:
  title: Service responds successfully
  probes:
    - type: probe
      name: app-responds
      tolerance: 200
      provider:
        type: http
        url: https://myapp.example.com/health
        timeout: 5

method:
  - type: action
    name: terminate-pod
    provider:
      type: python
      module: chaosk8s.pod.actions
      func: terminate_pods
      arguments:
        label_selector: app=myapp
        ns: production
        qty: 1
        rand: true

  - type: probe
    name: app-still-responds
    tolerance: 200
    provider:
      type: http
      url: https://myapp.example.com/health
      timeout: 5

rollbacks:
  - type: action
    name: scale-deployment
    provider:
      type: python
      module: chaosk8s.deployment.actions
      func: scale_deployment
      arguments:
        name: myapp
        replicas: 3
        ns: production

Run the experiment:

chaos run experiment.yaml

Litmus Chaos for Kubernetes

Litmus provides Kubernetes-native chaos experiments:

Install Litmus:

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v2.0.0.yaml

Define a chaos experiment:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=myapp
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CHAOS_INTERVAL
              value: "10"
            - name: FORCE
              value: "false"

Create RBAC:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: litmus-admin
  namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: litmus-admin
  namespace: production
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["create", "delete", "get", "list", "patch", "update", "deletecollection"]
  - apiGroups: ["apps"]
    resources: ["deployments", "statefulsets", "replicasets"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: litmus-admin
  namespace: production
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: litmus-admin
subjects:
  - kind: ServiceAccount
    name: litmus-admin
    namespace: production

Common Chaos Experiments

Pod Failure

Test application resilience to pod crashes:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: pod-delete
spec:
  definition:
    scope: Namespaced
    permissions:
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["delete", "get", "list"]
    image: "litmuschaos/go-runner:latest"
    args:
      - -c
      - ./experiments -name pod-delete
    command:
      - /bin/bash
    env:
      - name: TOTAL_CHAOS_DURATION
        value: "30"
      - name: FORCE
        value: "true"
      - name: CHAOS_INTERVAL
        value: "10"

Network Latency

Introduce network delays:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: network-latency
spec:
  appinfo:
    appns: production
    applabel: app=myapp
    appkind: deployment
  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - name: NETWORK_LATENCY
              value: "2000"  # 2 seconds
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: TARGET_CONTAINER
              value: "myapp"
            - name: NETWORK_INTERFACE
              value: "eth0"

Resource Exhaustion

Test behavior under CPU or memory pressure:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: memory-stress
spec:
  appinfo:
    appns: production
    applabel: app=myapp
    appkind: deployment
  experiments:
    - name: pod-memory-hog
      spec:
        components:
          env:
            - name: MEMORY_CONSUMPTION
              value: "500"  # MB
            - name: TOTAL_CHAOS_DURATION
              value: "60"

DNS Errors

Simulate DNS resolution failures:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: dns-chaos
spec:
  appinfo:
    appns: production
    applabel: app=myapp
    appkind: deployment
  experiments:
    - name: pod-dns-error
      spec:
        components:
          env:
            - name: TARGET_HOSTNAMES
              value: "external-api.example.com"
            - name: TOTAL_CHAOS_DURATION
              value: "60"

Building Resilient Applications

Chaos experiments reveal weaknesses. Here’s how to address them:

Implement Retries with Exponential Backoff

func callExternalAPI(ctx context.Context, url string) (*Response, error) {
    maxRetries := 3
    baseDelay := 100 * time.Millisecond

    for attempt := 0; attempt < maxRetries; attempt++ {
        resp, err := http.Get(url)
        if err == nil && resp.StatusCode < 500 {
            return resp, nil
        }

        if attempt < maxRetries-1 {
            delay := baseDelay * time.Duration(1<<uint(attempt))
            select {
            case <-time.After(delay):
                continue
            case <-ctx.Done():
                return nil, ctx.Err()
            }
        }
    }

    return nil, fmt.Errorf("max retries exceeded")
}

Add Circuit Breakers

import "github.com/sony/gobreaker"

var cb *gobreaker.CircuitBreaker

func init() {
    settings := gobreaker.Settings{
        Name:        "ExternalAPI",
        MaxRequests: 3,
        Interval:    time.Minute,
        Timeout:     30 * time.Second,
        ReadyToTrip: func(counts gobreaker.Counts) bool {
            failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
            return counts.Requests >= 3 && failureRatio >= 0.6
        },
    }
    cb = gobreaker.NewCircuitBreaker(settings)
}

func callWithCircuitBreaker(url string) (*Response, error) {
    result, err := cb.Execute(func() (interface{}, error) {
        return http.Get(url)
    })

    if err != nil {
        return nil, err
    }

    return result.(*Response), nil
}

Implement Timeouts

func callWithTimeout(url string) (*Response, error) {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()

    req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
    if err != nil {
        return nil, err
    }

    client := &http.Client{}
    return client.Do(req)
}

Graceful Degradation

type Service struct {
    cache Cache
    db    Database
}

func (s *Service) GetUser(id string) (*User, error) {
    // Try cache first
    if user, err := s.cache.Get(id); err == nil {
        return user, nil
    }

    // Fallback to database
    user, err := s.db.GetUser(id)
    if err != nil {
        // Return degraded response
        return &User{
            ID:   id,
            Name: "User data temporarily unavailable",
        }, nil
    }

    // Update cache
    s.cache.Set(id, user)
    return user, nil
}

Observability for Chaos

Effective chaos engineering requires comprehensive observability:

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    requestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint", "status"},
    )

    errorCounter = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_errors_total",
            Help: "Total HTTP errors",
        },
        []string{"method", "endpoint", "type"},
    )
)

func instrumentedHandler(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}
        next.ServeHTTP(wrapped, r)

        duration := time.Since(start).Seconds()
        requestDuration.WithLabelValues(
            r.Method,
            r.URL.Path,
            fmt.Sprintf("%d", wrapped.statusCode),
        ).Observe(duration)

        if wrapped.statusCode >= 500 {
            errorCounter.WithLabelValues(
                r.Method,
                r.URL.Path,
                "server_error",
            ).Inc()
        }
    })
}

Automated Continuous Chaos

Schedule regular chaos experiments:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: chaos-experiment
  namespace: litmus
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: litmus-admin
          containers:
            - name: chaos
              image: litmuschaos/litmus-checker:latest
              command: ["/bin/bash"]
              args:
                - -c
                - |
                  kubectl apply -f /experiments/pod-delete.yaml
                  sleep 300
                  kubectl delete chaosengine pod-delete-chaos
          restartPolicy: OnFailure

Progressive Blast Radius

Start small and expand:

# Week 1: Single non-production pod
experiments:
  - name: pod-delete
    spec:
      components:
        env:
          - name: TARGET_PODS
            value: "1"

# Week 2: Multiple non-production pods
experiments:
  - name: pod-delete
    spec:
      components:
        env:
          - name: TARGET_PODS
            value: "3"

# Week 3: Production with low traffic
# Week 4: Full production deployment

Safety Controls

Implement guardrails:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: safe-chaos
spec:
  appinfo:
    appns: production
    applabel: app=myapp
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  # Automatically stop if hypothesis fails
  terminationGracePeriodSeconds: 30
  experiments:
    - name: pod-delete
      spec:
        probe:
          - name: check-service-health
            type: httpProbe
            httpProbe/inputs:
              url: https://myapp.example.com/health
              expectedResponseCode: "200"
            mode: Continuous
            runProperties:
              probeTimeout: 5
              interval: 2
              retry: 3

Metrics and Success Criteria

Define what success looks like:

metrics:
  - name: availability
    threshold: ">= 99.9%"
    query: |
      sum(rate(http_requests_total{status!~"5.."}[5m])) /
      sum(rate(http_requests_total[5m]))

  - name: latency_p99
    threshold: "<= 500ms"
    query: |
      histogram_quantile(0.99,
        rate(http_request_duration_seconds_bucket[5m]))

  - name: error_rate
    threshold: "<= 0.1%"
    query: |
      sum(rate(http_errors_total[5m])) /
      sum(rate(http_requests_total[5m]))

Building a Chaos Engineering Culture

Technical implementation is only part of the challenge:

Get buy-in: Demonstrate value with small, safe experiments
Start outside production: Build confidence in staging first
Run game days: Make chaos engineering a team activity
Document everything: Maintain runbooks and post-mortems
Celebrate learnings: Focus on improvements, not blame
Make it continuous: Automate experiments over time

Common Pitfalls

Avoid these mistakes:

Too much too soon: Start with gentle experiments
No hypothesis: Define expected behavior first
Poor observability: You can’t learn if you can’t measure
Ignoring findings: Act on discovered weaknesses
Running once: Make chaos continuous
Production YOLO: Build confidence in lower environments first

Conclusion

Chaos engineering transforms how we think about system reliability. Instead of hoping systems will handle failures gracefully, we verify they actually do through controlled experiments.

Start small:

Define steady-state metrics
Run manual game days
Automate simple experiments (pod deletion)
Gradually expand scope and complexity
Make chaos continuous

The goal isn’t to cause outages—it’s to discover weaknesses before they cause outages. Every chaos experiment is an opportunity to build a more resilient system and develop confidence in its behavior under stress.

Remember: You don’t find out if your parachute works by jumping out of a plane. You test it on the ground first. Chaos engineering is testing your parachutes.