Microservices fail. Networks are unreliable. Dependencies go down. If you build a distributed system assuming everything works perfectly, you’re building a system that will fail catastrophically.

I’ve operated microservices at scale long enough to know that resilience isn’t optional—it’s the foundation. Today, I want to share the essential patterns I use to build systems that gracefully handle failure rather than cascading into outages.

The Cascading Failure Problem

Consider this scenario: Your payment service depends on a fraud detection API. The fraud API starts responding slowly due to a database issue. Your payment service waits for responses, exhausting its connection pool. Requests queue up. Your order service, which calls the payment service, also starts timing out. Soon your entire checkout flow is down, even though only one downstream dependency has an issue.

This is a cascading failure, and it’s the most common way distributed systems collapse.

Timeout Pattern

The simplest resilience pattern: don’t wait forever.

func callPaymentService(ctx context.Context, payment *Payment) error {
    // Create a context with timeout
    ctx, cancel := context.WithTimeout(ctx, 3*time.Second)
    defer cancel()

    req, err := http.NewRequestWithContext(ctx, "POST",
        "http://payment-service/charge", serializePayment(payment))
    if err != nil {
        return err
    }

    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        // This includes timeout errors
        if errors.Is(err, context.DeadlineExceeded) {
            return fmt.Errorf("payment service timeout: %w", err)
        }
        return err
    }
    defer resp.Body.Close()

    if resp.StatusCode != http.StatusOK {
        return fmt.Errorf("payment failed: status %d", resp.StatusCode)
    }

    return nil
}

Key decisions:

  • Timeout value: Should be longer than typical latency, but short enough to fail fast. For most services, I use 3-5 seconds for synchronous calls.
  • Different timeouts for different operations: Database queries might timeout at 1 second, external APIs at 10 seconds.
  • Propagate timeouts: If your service has a 5-second timeout from clients, use shorter timeouts (e.g., 2 seconds) for your dependencies to leave time for processing.

Retry Pattern

Some failures are transient. A network blip. A temporarily overloaded server. Retrying can succeed where the first attempt failed.

type RetryConfig struct {
    MaxAttempts int
    BaseDelay   time.Duration
    MaxDelay    time.Duration
    Multiplier  float64
}

func RetryWithBackoff(ctx context.Context, config RetryConfig, fn func() error) error {
    var lastErr error

    for attempt := 1; attempt <= config.MaxAttempts; attempt++ {
        err := fn()
        if err == nil {
            return nil // Success
        }

        lastErr = err

        // Don't retry on certain errors
        if !isRetriable(err) {
            return err
        }

        // Don't retry if we're out of attempts
        if attempt == config.MaxAttempts {
            break
        }

        // Calculate backoff delay with exponential increase
        delay := config.BaseDelay * time.Duration(math.Pow(config.Multiplier, float64(attempt-1)))
        if delay > config.MaxDelay {
            delay = config.MaxDelay
        }

        // Add jitter to prevent thundering herd
        jitter := time.Duration(rand.Float64() * float64(delay) * 0.1)
        delay += jitter

        log.Printf("Attempt %d failed: %v. Retrying in %v", attempt, err, delay)

        // Wait before retrying
        select {
        case <-time.After(delay):
            // Continue to next attempt
        case <-ctx.Done():
            return ctx.Err()
        }
    }

    return fmt.Errorf("max retries exceeded: %w", lastErr)
}

func isRetriable(err error) bool {
    // Don't retry client errors (4xx)
    if httpErr, ok := err.(*HTTPError); ok {
        if httpErr.StatusCode >= 400 && httpErr.StatusCode < 500 {
            return false
        }
    }

    // Retry network errors, timeouts, 5xx errors
    return true
}

// Usage
err := RetryWithBackoff(ctx, RetryConfig{
    MaxAttempts: 3,
    BaseDelay:   100 * time.Millisecond,
    MaxDelay:    2 * time.Second,
    Multiplier:  2.0,
}, func() error {
    return callPaymentService(ctx, payment)
})

Important considerations:

  • Idempotency: Only retry operations that are safe to repeat. Use idempotency keys for operations that aren’t naturally idempotent.
  • Exponential backoff: Increase delay between retries to avoid hammering a struggling service.
  • Jitter: Add randomness to retry timing to prevent synchronized retries from all clients.
  • Maximum attempts: Don’t retry forever. Three attempts is usually sufficient.

Circuit Breaker Pattern

When a dependency is failing, stop calling it. Give it time to recover instead of overwhelming it with requests.

type CircuitState int

const (
    StateClosed CircuitState = iota
    StateOpen
    StateHalfOpen
)

type CircuitBreaker struct {
    maxFailures  int
    resetTimeout time.Duration

    state        CircuitState
    failures     int
    lastFailTime time.Time
    mu           sync.RWMutex
}

func NewCircuitBreaker(maxFailures int, resetTimeout time.Duration) *CircuitBreaker {
    return &CircuitBreaker{
        maxFailures:  maxFailures,
        resetTimeout: resetTimeout,
        state:        StateClosed,
    }
}

func (cb *CircuitBreaker) Call(fn func() error) error {
    if !cb.canAttempt() {
        return errors.New("circuit breaker is open")
    }

    err := fn()

    if err != nil {
        cb.recordFailure()
        return err
    }

    cb.recordSuccess()
    return nil
}

func (cb *CircuitBreaker) canAttempt() bool {
    cb.mu.RLock()
    defer cb.mu.RUnlock()

    if cb.state == StateClosed {
        return true
    }

    if cb.state == StateOpen {
        // Check if enough time has passed to try again
        if time.Since(cb.lastFailTime) > cb.resetTimeout {
            cb.mu.RUnlock()
            cb.mu.Lock()
            cb.state = StateHalfOpen
            cb.mu.Unlock()
            cb.mu.RLock()
            return true
        }
        return false
    }

    // StateHalfOpen - allow one attempt
    return true
}

func (cb *CircuitBreaker) recordFailure() {
    cb.mu.Lock()
    defer cb.mu.Unlock()

    cb.failures++
    cb.lastFailTime = time.Now()

    if cb.state == StateHalfOpen {
        // Failed during half-open, go back to open
        cb.state = StateOpen
        log.Println("Circuit breaker: half-open -> open")
        return
    }

    if cb.failures >= cb.maxFailures {
        cb.state = StateOpen
        log.Printf("Circuit breaker opened after %d failures", cb.failures)
    }
}

func (cb *CircuitBreaker) recordSuccess() {
    cb.mu.Lock()
    defer cb.mu.Unlock()

    if cb.state == StateHalfOpen {
        // Success during half-open, close the circuit
        cb.state = StateClosed
        cb.failures = 0
        log.Println("Circuit breaker: half-open -> closed")
        return
    }

    // Reset failure count on success
    cb.failures = 0
}

// Usage
paymentBreaker := NewCircuitBreaker(5, 30*time.Second)

err := paymentBreaker.Call(func() error {
    return callPaymentService(ctx, payment)
})

if err != nil {
    if err.Error() == "circuit breaker is open" {
        // Handle degraded mode - maybe use cached data or return error
        return errors.New("payment service unavailable")
    }
    return err
}

Circuit states:

  • Closed: Normal operation, requests flow through
  • Open: Too many failures, block all requests for a timeout period
  • Half-Open: After timeout, allow one request to test if service recovered

Bulkhead Pattern

Isolate resources so that failure in one area doesn’t exhaust resources for others.

type BulkheadPool struct {
    pools map[string]*ResourcePool
    mu    sync.RWMutex
}

type ResourcePool struct {
    semaphore chan struct{}
    name      string
}

func NewBulkheadPool() *BulkheadPool {
    return &BulkheadPool{
        pools: make(map[string]*ResourcePool),
    }
}

func (bp *BulkheadPool) AddPool(name string, maxConcurrent int) {
    bp.mu.Lock()
    defer bp.mu.Unlock()

    bp.pools[name] = &ResourcePool{
        semaphore: make(chan struct{}, maxConcurrent),
        name:      name,
    }
}

func (bp *BulkheadPool) Execute(poolName string, fn func() error) error {
    bp.mu.RLock()
    pool, ok := bp.pools[poolName]
    bp.mu.RUnlock()

    if !ok {
        return fmt.Errorf("pool %s not found", poolName)
    }

    // Try to acquire a slot
    select {
    case pool.semaphore <- struct{}{}:
        // Got a slot
        defer func() { <-pool.semaphore }()
        return fn()
    default:
        // Pool is full
        return fmt.Errorf("pool %s is at capacity", poolName)
    }
}

// Usage - separate pools for different dependencies
bulkhead := NewBulkheadPool()
bulkhead.AddPool("payment-service", 10)   // Max 10 concurrent payment calls
bulkhead.AddPool("inventory-service", 20) // Max 20 concurrent inventory calls

err := bulkhead.Execute("payment-service", func() error {
    return callPaymentService(ctx, payment)
})

This prevents a slow payment service from exhausting all connections, leaving resources for inventory and other services.

Fallback Pattern

When a service fails, provide a degraded but functional response.

type RecommendationService struct {
    mlService    *MLRecommendationService
    cacheService *CacheService
    defaultRecs  []string
}

func (s *RecommendationService) GetRecommendations(ctx context.Context, userID string) ([]string, error) {
    // Try ML-based recommendations
    recs, err := s.mlService.GetRecommendations(ctx, userID)
    if err == nil {
        return recs, nil
    }

    log.Printf("ML recommendations failed: %v. Trying cache.", err)

    // Fallback 1: Try cached recommendations
    cached, err := s.cacheService.Get(ctx, "recs:"+userID)
    if err == nil {
        return cached, nil
    }

    log.Printf("Cache miss: %v. Using default recommendations.", err)

    // Fallback 2: Return popular items
    return s.defaultRecs, nil
}

Fallback strategies:

  • Cached data: Return stale but valid data
  • Default values: Return sensible defaults (popular items, empty list, etc.)
  • Simplified logic: Use a simpler algorithm that doesn’t depend on the failing service
  • Degraded functionality: Disable non-critical features

Rate Limiting

Protect your service from being overwhelmed.

type RateLimiter struct {
    requests map[string]*TokenBucket
    mu       sync.RWMutex
}

type TokenBucket struct {
    tokens      float64
    maxTokens   float64
    refillRate  float64 // tokens per second
    lastRefill  time.Time
    mu          sync.Mutex
}

func NewTokenBucket(maxTokens float64, refillRate float64) *TokenBucket {
    return &TokenBucket{
        tokens:     maxTokens,
        maxTokens:  maxTokens,
        refillRate: refillRate,
        lastRefill: time.Now(),
    }
}

func (tb *TokenBucket) Allow() bool {
    tb.mu.Lock()
    defer tb.mu.Unlock()

    // Refill tokens based on time elapsed
    now := time.Now()
    elapsed := now.Sub(tb.lastRefill).Seconds()
    tb.tokens = math.Min(tb.maxTokens, tb.tokens+elapsed*tb.refillRate)
    tb.lastRefill = now

    // Check if we have a token available
    if tb.tokens >= 1.0 {
        tb.tokens -= 1.0
        return true
    }

    return false
}

func (rl *RateLimiter) AllowRequest(clientID string) bool {
    rl.mu.RLock()
    bucket, ok := rl.requests[clientID]
    rl.mu.RUnlock()

    if !ok {
        rl.mu.Lock()
        bucket = NewTokenBucket(100, 10) // 100 tokens, refill 10/sec
        rl.requests[clientID] = bucket
        rl.mu.Unlock()
    }

    return bucket.Allow()
}

// Middleware
func RateLimitMiddleware(limiter *RateLimiter) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            clientID := getClientID(r) // Could be IP, user ID, API key

            if !limiter.AllowRequest(clientID) {
                http.Error(w, "Rate limit exceeded", http.StatusTooManyRequests)
                return
            }

            next.ServeHTTP(w, r)
        })
    }
}

Combining Patterns

Real resilience comes from combining these patterns:

type ResilientClient struct {
    httpClient      *http.Client
    circuitBreaker  *CircuitBreaker
    bulkhead        *BulkheadPool
    retryConfig     RetryConfig
}

func (c *ResilientClient) Call(ctx context.Context, url string) ([]byte, error) {
    var result []byte

    // Bulkhead: Limit concurrent requests
    err := c.bulkhead.Execute("external-api", func() error {
        // Circuit breaker: Stop calling if service is down
        return c.circuitBreaker.Call(func() error {
            // Retry with backoff: Handle transient failures
            return RetryWithBackoff(ctx, c.retryConfig, func() error {
                // Timeout: Don't wait forever
                ctx, cancel := context.WithTimeout(ctx, 3*time.Second)
                defer cancel()

                req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
                if err != nil {
                    return err
                }

                resp, err := c.httpClient.Do(req)
                if err != nil {
                    return err
                }
                defer resp.Body.Close()

                if resp.StatusCode != http.StatusOK {
                    return fmt.Errorf("status: %d", resp.StatusCode)
                }

                result, err = io.ReadAll(resp.Body)
                return err
            })
        })
    })

    return result, err
}

This client has multiple layers of protection:

  1. Timeout prevents hanging forever
  2. Retry handles transient failures
  3. Circuit breaker stops calling a down service
  4. Bulkhead prevents resource exhaustion

Health Checks and Load Balancing

Remove unhealthy instances from rotation:

type HealthChecker struct {
    checkInterval time.Duration
    timeout       time.Duration
}

func (hc *HealthChecker) CheckHealth(endpoint string) bool {
    ctx, cancel := context.WithTimeout(context.Background(), hc.timeout)
    defer cancel()

    req, err := http.NewRequestWithContext(ctx, "GET", endpoint+"/health", nil)
    if err != nil {
        return false
    }

    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        return false
    }
    defer resp.Body.Close()

    return resp.StatusCode == http.StatusOK
}

type LoadBalancer struct {
    backends      []string
    healthy       map[string]bool
    mu            sync.RWMutex
    healthChecker *HealthChecker
    current       int
}

func (lb *LoadBalancer) StartHealthChecks() {
    ticker := time.NewTicker(lb.healthChecker.checkInterval)
    go func() {
        for range ticker.C {
            for _, backend := range lb.backends {
                healthy := lb.healthChecker.CheckHealth(backend)

                lb.mu.Lock()
                lb.healthy[backend] = healthy
                lb.mu.Unlock()

                if !healthy {
                    log.Printf("Backend %s is unhealthy", backend)
                }
            }
        }
    }()
}

func (lb *LoadBalancer) GetHealthyBackend() (string, error) {
    lb.mu.RLock()
    defer lb.mu.RUnlock()

    // Round-robin through healthy backends
    attempts := 0
    for attempts < len(lb.backends) {
        backend := lb.backends[lb.current%len(lb.backends)]
        lb.current++

        if lb.healthy[backend] {
            return backend, nil
        }

        attempts++
    }

    return "", errors.New("no healthy backends available")
}

Monitoring Resilience

Track resilience metrics:

type ResilienceMetrics struct {
    circuitBreakerState prometheus.Gauge
    retryCount          prometheus.Counter
    timeoutCount        prometheus.Counter
    fallbackCount       prometheus.Counter
}

func (m *ResilienceMetrics) RecordCircuitState(service string, state CircuitState) {
    m.circuitBreakerState.WithLabelValues(service).Set(float64(state))
}

func (m *ResilienceMetrics) RecordRetry(service string) {
    m.retryCount.WithLabelValues(service).Inc()
}

func (m *ResilienceMetrics) RecordTimeout(service string) {
    m.timeoutCount.WithLabelValues(service).Inc()
}

func (m *ResilienceMetrics) RecordFallback(service string) {
    m.fallbackCount.WithLabelValues(service).Inc()
}

Alert on:

  • Circuit breakers opening (indicates service issues)
  • High retry rates (indicates instability)
  • Frequent timeouts (may need timeout tuning)
  • Fallback usage (degraded functionality)

Testing Resilience

Chaos engineering: intentionally break things to validate resilience.

// Chaos middleware that randomly fails requests
func ChaosMiddleware(failureRate float64) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            if rand.Float64() < failureRate {
                // Simulate different failure modes
                mode := rand.Intn(3)
                switch mode {
                case 0:
                    // Timeout
                    time.Sleep(10 * time.Second)
                case 1:
                    // Error
                    http.Error(w, "Chaos: simulated error", http.StatusInternalServerError)
                    return
                case 2:
                    // Slow response
                    time.Sleep(2 * time.Second)
                }
            }

            next.ServeHTTP(w, r)
        })
    }
}

Best Practices

Design for failure: Assume every call can fail and plan accordingly.

Fail fast: Use short timeouts to detect issues quickly.

Degrade gracefully: Provide reduced functionality rather than complete failure.

Monitor everything: Track resilience patterns to understand system behavior.

Test in production: Use canary deployments and gradual rollouts.

Document behavior: Make it clear how your service behaves when dependencies fail.

Looking Forward

Resilience patterns are becoming infrastructure primitives. Service meshes provide circuit breakers and retries automatically. Kubernetes offers health checks and rolling deployments. But understanding the fundamentals remains critical for building robust systems.

The patterns I’ve shared—timeouts, retries, circuit breakers, bulkheads, and fallbacks—are battle-tested. Implement them thoughtfully, tune them based on your system’s characteristics, and monitor their effectiveness.

When your payment API has a bad day, your resilient architecture will keep the rest of your system running.