Microservices fail. Networks are unreliable. Dependencies go down. If you build a distributed system assuming everything works perfectly, you’re building a system that will fail catastrophically.
I’ve operated microservices at scale long enough to know that resilience isn’t optional—it’s the foundation. Today, I want to share the essential patterns I use to build systems that gracefully handle failure rather than cascading into outages.
The Cascading Failure Problem
Consider this scenario: Your payment service depends on a fraud detection API. The fraud API starts responding slowly due to a database issue. Your payment service waits for responses, exhausting its connection pool. Requests queue up. Your order service, which calls the payment service, also starts timing out. Soon your entire checkout flow is down, even though only one downstream dependency has an issue.
This is a cascading failure, and it’s the most common way distributed systems collapse.
Timeout Pattern
The simplest resilience pattern: don’t wait forever.
func callPaymentService(ctx context.Context, payment *Payment) error {
// Create a context with timeout
ctx, cancel := context.WithTimeout(ctx, 3*time.Second)
defer cancel()
req, err := http.NewRequestWithContext(ctx, "POST",
"http://payment-service/charge", serializePayment(payment))
if err != nil {
return err
}
resp, err := http.DefaultClient.Do(req)
if err != nil {
// This includes timeout errors
if errors.Is(err, context.DeadlineExceeded) {
return fmt.Errorf("payment service timeout: %w", err)
}
return err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
return fmt.Errorf("payment failed: status %d", resp.StatusCode)
}
return nil
}
Key decisions:
- Timeout value: Should be longer than typical latency, but short enough to fail fast. For most services, I use 3-5 seconds for synchronous calls.
- Different timeouts for different operations: Database queries might timeout at 1 second, external APIs at 10 seconds.
- Propagate timeouts: If your service has a 5-second timeout from clients, use shorter timeouts (e.g., 2 seconds) for your dependencies to leave time for processing.
Retry Pattern
Some failures are transient. A network blip. A temporarily overloaded server. Retrying can succeed where the first attempt failed.
type RetryConfig struct {
MaxAttempts int
BaseDelay time.Duration
MaxDelay time.Duration
Multiplier float64
}
func RetryWithBackoff(ctx context.Context, config RetryConfig, fn func() error) error {
var lastErr error
for attempt := 1; attempt <= config.MaxAttempts; attempt++ {
err := fn()
if err == nil {
return nil // Success
}
lastErr = err
// Don't retry on certain errors
if !isRetriable(err) {
return err
}
// Don't retry if we're out of attempts
if attempt == config.MaxAttempts {
break
}
// Calculate backoff delay with exponential increase
delay := config.BaseDelay * time.Duration(math.Pow(config.Multiplier, float64(attempt-1)))
if delay > config.MaxDelay {
delay = config.MaxDelay
}
// Add jitter to prevent thundering herd
jitter := time.Duration(rand.Float64() * float64(delay) * 0.1)
delay += jitter
log.Printf("Attempt %d failed: %v. Retrying in %v", attempt, err, delay)
// Wait before retrying
select {
case <-time.After(delay):
// Continue to next attempt
case <-ctx.Done():
return ctx.Err()
}
}
return fmt.Errorf("max retries exceeded: %w", lastErr)
}
func isRetriable(err error) bool {
// Don't retry client errors (4xx)
if httpErr, ok := err.(*HTTPError); ok {
if httpErr.StatusCode >= 400 && httpErr.StatusCode < 500 {
return false
}
}
// Retry network errors, timeouts, 5xx errors
return true
}
// Usage
err := RetryWithBackoff(ctx, RetryConfig{
MaxAttempts: 3,
BaseDelay: 100 * time.Millisecond,
MaxDelay: 2 * time.Second,
Multiplier: 2.0,
}, func() error {
return callPaymentService(ctx, payment)
})
Important considerations:
- Idempotency: Only retry operations that are safe to repeat. Use idempotency keys for operations that aren’t naturally idempotent.
- Exponential backoff: Increase delay between retries to avoid hammering a struggling service.
- Jitter: Add randomness to retry timing to prevent synchronized retries from all clients.
- Maximum attempts: Don’t retry forever. Three attempts is usually sufficient.
Circuit Breaker Pattern
When a dependency is failing, stop calling it. Give it time to recover instead of overwhelming it with requests.
type CircuitState int
const (
StateClosed CircuitState = iota
StateOpen
StateHalfOpen
)
type CircuitBreaker struct {
maxFailures int
resetTimeout time.Duration
state CircuitState
failures int
lastFailTime time.Time
mu sync.RWMutex
}
func NewCircuitBreaker(maxFailures int, resetTimeout time.Duration) *CircuitBreaker {
return &CircuitBreaker{
maxFailures: maxFailures,
resetTimeout: resetTimeout,
state: StateClosed,
}
}
func (cb *CircuitBreaker) Call(fn func() error) error {
if !cb.canAttempt() {
return errors.New("circuit breaker is open")
}
err := fn()
if err != nil {
cb.recordFailure()
return err
}
cb.recordSuccess()
return nil
}
func (cb *CircuitBreaker) canAttempt() bool {
cb.mu.RLock()
defer cb.mu.RUnlock()
if cb.state == StateClosed {
return true
}
if cb.state == StateOpen {
// Check if enough time has passed to try again
if time.Since(cb.lastFailTime) > cb.resetTimeout {
cb.mu.RUnlock()
cb.mu.Lock()
cb.state = StateHalfOpen
cb.mu.Unlock()
cb.mu.RLock()
return true
}
return false
}
// StateHalfOpen - allow one attempt
return true
}
func (cb *CircuitBreaker) recordFailure() {
cb.mu.Lock()
defer cb.mu.Unlock()
cb.failures++
cb.lastFailTime = time.Now()
if cb.state == StateHalfOpen {
// Failed during half-open, go back to open
cb.state = StateOpen
log.Println("Circuit breaker: half-open -> open")
return
}
if cb.failures >= cb.maxFailures {
cb.state = StateOpen
log.Printf("Circuit breaker opened after %d failures", cb.failures)
}
}
func (cb *CircuitBreaker) recordSuccess() {
cb.mu.Lock()
defer cb.mu.Unlock()
if cb.state == StateHalfOpen {
// Success during half-open, close the circuit
cb.state = StateClosed
cb.failures = 0
log.Println("Circuit breaker: half-open -> closed")
return
}
// Reset failure count on success
cb.failures = 0
}
// Usage
paymentBreaker := NewCircuitBreaker(5, 30*time.Second)
err := paymentBreaker.Call(func() error {
return callPaymentService(ctx, payment)
})
if err != nil {
if err.Error() == "circuit breaker is open" {
// Handle degraded mode - maybe use cached data or return error
return errors.New("payment service unavailable")
}
return err
}
Circuit states:
- Closed: Normal operation, requests flow through
- Open: Too many failures, block all requests for a timeout period
- Half-Open: After timeout, allow one request to test if service recovered
Bulkhead Pattern
Isolate resources so that failure in one area doesn’t exhaust resources for others.
type BulkheadPool struct {
pools map[string]*ResourcePool
mu sync.RWMutex
}
type ResourcePool struct {
semaphore chan struct{}
name string
}
func NewBulkheadPool() *BulkheadPool {
return &BulkheadPool{
pools: make(map[string]*ResourcePool),
}
}
func (bp *BulkheadPool) AddPool(name string, maxConcurrent int) {
bp.mu.Lock()
defer bp.mu.Unlock()
bp.pools[name] = &ResourcePool{
semaphore: make(chan struct{}, maxConcurrent),
name: name,
}
}
func (bp *BulkheadPool) Execute(poolName string, fn func() error) error {
bp.mu.RLock()
pool, ok := bp.pools[poolName]
bp.mu.RUnlock()
if !ok {
return fmt.Errorf("pool %s not found", poolName)
}
// Try to acquire a slot
select {
case pool.semaphore <- struct{}{}:
// Got a slot
defer func() { <-pool.semaphore }()
return fn()
default:
// Pool is full
return fmt.Errorf("pool %s is at capacity", poolName)
}
}
// Usage - separate pools for different dependencies
bulkhead := NewBulkheadPool()
bulkhead.AddPool("payment-service", 10) // Max 10 concurrent payment calls
bulkhead.AddPool("inventory-service", 20) // Max 20 concurrent inventory calls
err := bulkhead.Execute("payment-service", func() error {
return callPaymentService(ctx, payment)
})
This prevents a slow payment service from exhausting all connections, leaving resources for inventory and other services.
Fallback Pattern
When a service fails, provide a degraded but functional response.
type RecommendationService struct {
mlService *MLRecommendationService
cacheService *CacheService
defaultRecs []string
}
func (s *RecommendationService) GetRecommendations(ctx context.Context, userID string) ([]string, error) {
// Try ML-based recommendations
recs, err := s.mlService.GetRecommendations(ctx, userID)
if err == nil {
return recs, nil
}
log.Printf("ML recommendations failed: %v. Trying cache.", err)
// Fallback 1: Try cached recommendations
cached, err := s.cacheService.Get(ctx, "recs:"+userID)
if err == nil {
return cached, nil
}
log.Printf("Cache miss: %v. Using default recommendations.", err)
// Fallback 2: Return popular items
return s.defaultRecs, nil
}
Fallback strategies:
- Cached data: Return stale but valid data
- Default values: Return sensible defaults (popular items, empty list, etc.)
- Simplified logic: Use a simpler algorithm that doesn’t depend on the failing service
- Degraded functionality: Disable non-critical features
Rate Limiting
Protect your service from being overwhelmed.
type RateLimiter struct {
requests map[string]*TokenBucket
mu sync.RWMutex
}
type TokenBucket struct {
tokens float64
maxTokens float64
refillRate float64 // tokens per second
lastRefill time.Time
mu sync.Mutex
}
func NewTokenBucket(maxTokens float64, refillRate float64) *TokenBucket {
return &TokenBucket{
tokens: maxTokens,
maxTokens: maxTokens,
refillRate: refillRate,
lastRefill: time.Now(),
}
}
func (tb *TokenBucket) Allow() bool {
tb.mu.Lock()
defer tb.mu.Unlock()
// Refill tokens based on time elapsed
now := time.Now()
elapsed := now.Sub(tb.lastRefill).Seconds()
tb.tokens = math.Min(tb.maxTokens, tb.tokens+elapsed*tb.refillRate)
tb.lastRefill = now
// Check if we have a token available
if tb.tokens >= 1.0 {
tb.tokens -= 1.0
return true
}
return false
}
func (rl *RateLimiter) AllowRequest(clientID string) bool {
rl.mu.RLock()
bucket, ok := rl.requests[clientID]
rl.mu.RUnlock()
if !ok {
rl.mu.Lock()
bucket = NewTokenBucket(100, 10) // 100 tokens, refill 10/sec
rl.requests[clientID] = bucket
rl.mu.Unlock()
}
return bucket.Allow()
}
// Middleware
func RateLimitMiddleware(limiter *RateLimiter) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
clientID := getClientID(r) // Could be IP, user ID, API key
if !limiter.AllowRequest(clientID) {
http.Error(w, "Rate limit exceeded", http.StatusTooManyRequests)
return
}
next.ServeHTTP(w, r)
})
}
}
Combining Patterns
Real resilience comes from combining these patterns:
type ResilientClient struct {
httpClient *http.Client
circuitBreaker *CircuitBreaker
bulkhead *BulkheadPool
retryConfig RetryConfig
}
func (c *ResilientClient) Call(ctx context.Context, url string) ([]byte, error) {
var result []byte
// Bulkhead: Limit concurrent requests
err := c.bulkhead.Execute("external-api", func() error {
// Circuit breaker: Stop calling if service is down
return c.circuitBreaker.Call(func() error {
// Retry with backoff: Handle transient failures
return RetryWithBackoff(ctx, c.retryConfig, func() error {
// Timeout: Don't wait forever
ctx, cancel := context.WithTimeout(ctx, 3*time.Second)
defer cancel()
req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
if err != nil {
return err
}
resp, err := c.httpClient.Do(req)
if err != nil {
return err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
return fmt.Errorf("status: %d", resp.StatusCode)
}
result, err = io.ReadAll(resp.Body)
return err
})
})
})
return result, err
}
This client has multiple layers of protection:
- Timeout prevents hanging forever
- Retry handles transient failures
- Circuit breaker stops calling a down service
- Bulkhead prevents resource exhaustion
Health Checks and Load Balancing
Remove unhealthy instances from rotation:
type HealthChecker struct {
checkInterval time.Duration
timeout time.Duration
}
func (hc *HealthChecker) CheckHealth(endpoint string) bool {
ctx, cancel := context.WithTimeout(context.Background(), hc.timeout)
defer cancel()
req, err := http.NewRequestWithContext(ctx, "GET", endpoint+"/health", nil)
if err != nil {
return false
}
resp, err := http.DefaultClient.Do(req)
if err != nil {
return false
}
defer resp.Body.Close()
return resp.StatusCode == http.StatusOK
}
type LoadBalancer struct {
backends []string
healthy map[string]bool
mu sync.RWMutex
healthChecker *HealthChecker
current int
}
func (lb *LoadBalancer) StartHealthChecks() {
ticker := time.NewTicker(lb.healthChecker.checkInterval)
go func() {
for range ticker.C {
for _, backend := range lb.backends {
healthy := lb.healthChecker.CheckHealth(backend)
lb.mu.Lock()
lb.healthy[backend] = healthy
lb.mu.Unlock()
if !healthy {
log.Printf("Backend %s is unhealthy", backend)
}
}
}
}()
}
func (lb *LoadBalancer) GetHealthyBackend() (string, error) {
lb.mu.RLock()
defer lb.mu.RUnlock()
// Round-robin through healthy backends
attempts := 0
for attempts < len(lb.backends) {
backend := lb.backends[lb.current%len(lb.backends)]
lb.current++
if lb.healthy[backend] {
return backend, nil
}
attempts++
}
return "", errors.New("no healthy backends available")
}
Monitoring Resilience
Track resilience metrics:
type ResilienceMetrics struct {
circuitBreakerState prometheus.Gauge
retryCount prometheus.Counter
timeoutCount prometheus.Counter
fallbackCount prometheus.Counter
}
func (m *ResilienceMetrics) RecordCircuitState(service string, state CircuitState) {
m.circuitBreakerState.WithLabelValues(service).Set(float64(state))
}
func (m *ResilienceMetrics) RecordRetry(service string) {
m.retryCount.WithLabelValues(service).Inc()
}
func (m *ResilienceMetrics) RecordTimeout(service string) {
m.timeoutCount.WithLabelValues(service).Inc()
}
func (m *ResilienceMetrics) RecordFallback(service string) {
m.fallbackCount.WithLabelValues(service).Inc()
}
Alert on:
- Circuit breakers opening (indicates service issues)
- High retry rates (indicates instability)
- Frequent timeouts (may need timeout tuning)
- Fallback usage (degraded functionality)
Testing Resilience
Chaos engineering: intentionally break things to validate resilience.
// Chaos middleware that randomly fails requests
func ChaosMiddleware(failureRate float64) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if rand.Float64() < failureRate {
// Simulate different failure modes
mode := rand.Intn(3)
switch mode {
case 0:
// Timeout
time.Sleep(10 * time.Second)
case 1:
// Error
http.Error(w, "Chaos: simulated error", http.StatusInternalServerError)
return
case 2:
// Slow response
time.Sleep(2 * time.Second)
}
}
next.ServeHTTP(w, r)
})
}
}
Best Practices
Design for failure: Assume every call can fail and plan accordingly.
Fail fast: Use short timeouts to detect issues quickly.
Degrade gracefully: Provide reduced functionality rather than complete failure.
Monitor everything: Track resilience patterns to understand system behavior.
Test in production: Use canary deployments and gradual rollouts.
Document behavior: Make it clear how your service behaves when dependencies fail.
Looking Forward
Resilience patterns are becoming infrastructure primitives. Service meshes provide circuit breakers and retries automatically. Kubernetes offers health checks and rolling deployments. But understanding the fundamentals remains critical for building robust systems.
The patterns I’ve shared—timeouts, retries, circuit breakers, bulkheads, and fallbacks—are battle-tested. Implement them thoughtfully, tune them based on your system’s characteristics, and monitor their effectiveness.
When your payment API has a bad day, your resilient architecture will keep the rest of your system running.