Distributed System Design Patterns for Reliability and Scale

Building distributed systems is fundamentally different from building monolithic applications. In a monolith, function calls always succeed (or the whole application crashes). In distributed systems, network calls fail. Services go down. Latency spikes. Data gets out of sync.

After a year of building and operating distributed services—key management, encryption, authentication—I’ve learned that success depends on embracing failure as normal and designing for it from the start.

Here are the patterns that have worked for me.

The Fundamental Truth

In a distributed system, failure is not an exception—it’s the norm.

Networks are unreliable
Clocks drift
Services crash
Latency varies
Data diverges

Design for these realities, not against them.

Pattern 1: Circuit Breaker

Don’t keep calling a failing service. Fail fast and give it time to recover.

type CircuitBreaker struct {
    maxFailures  int
    timeout      time.Duration
    state        State
    failures     int
    lastAttempt  time.Time
    mu           sync.Mutex
}

type State int

const (
    Closed State = iota // Normal operation
    Open                // Failing, reject requests
    HalfOpen            // Testing if service recovered
)

func (cb *CircuitBreaker) Call(fn func() error) error {
    cb.mu.Lock()
    defer cb.mu.Unlock()

    // Check if circuit is open
    if cb.state == Open {
        if time.Since(cb.lastAttempt) < cb.timeout {
            return errors.New("circuit breaker open")
        }
        // Try to recover
        cb.state = HalfOpen
    }

    // Attempt call
    err := fn()

    if err != nil {
        cb.failures++
        cb.lastAttempt = time.Now()

        if cb.failures >= cb.maxFailures {
            cb.state = Open
        }

        return err
    }

    // Success - reset circuit breaker
    cb.failures = 0
    cb.state = Closed
    return nil
}

Usage:

func callEncryptionService(data []byte) error {
    return circuitBreaker.Call(func() error {
        return encryptionClient.Encrypt(data)
    })
}

This prevents cascading failures. If a downstream service is failing, stop sending it traffic and fail fast.

Pattern 2: Retry with Exponential Backoff

Network blips happen. Retry transient failures, but don’t hammer failing services.

func RetryWithBackoff(fn func() error, maxRetries int) error {
    var err error

    for i := 0; i < maxRetries; i++ {
        err = fn()
        if err == nil {
            return nil
        }

        // Don't retry client errors (4xx)
        if isClientError(err) {
            return err
        }

        // Exponential backoff with jitter
        backoff := time.Duration(math.Pow(2, float64(i))) * time.Second
        jitter := time.Duration(rand.Int63n(int64(backoff / 2)))
        time.Sleep(backoff + jitter)

        log.Warn("retrying after error",
            "attempt", i+1,
            "max_retries", maxRetries,
            "error", err,
        )
    }

    return fmt.Errorf("max retries exceeded: %w", err)
}

func isClientError(err error) bool {
    // Don't retry 4xx errors - they won't succeed on retry
    if httpErr, ok := err.(*HTTPError); ok {
        return httpErr.StatusCode >= 400 && httpErr.StatusCode < 500
    }
    return false
}

Usage:

err := RetryWithBackoff(func() error {
    return saveToDatabase(data)
}, 3)

The exponential backoff prevents thundering herd problems. Jitter prevents synchronized retries from multiple clients.

Pattern 3: Timeout Everything

Never wait forever. Every remote call needs a timeout.

func callWithTimeout(ctx context.Context, fn func() error, timeout time.Duration) error {
    ctx, cancel := context.WithTimeout(ctx, timeout)
    defer cancel()

    errChan := make(chan error, 1)

    go func() {
        errChan <- fn()
    }()

    select {
    case err := <-errChan:
        return err
    case <-ctx.Done():
        return ctx.Err()
    }
}

// Usage
err := callWithTimeout(context.Background(), func() error {
    return fetchKeyFromDatabase(keyID)
}, 5*time.Second)

if err == context.DeadlineExceeded {
    log.Error("database call timed out")
}

Set timeouts based on SLA requirements. If your API promises 100ms response time, set aggressive timeouts on dependencies.

Pattern 4: Bulkhead Pattern

Isolate resources so failure in one area doesn’t bring down the whole system.

type BulkheadExecutor struct {
    semaphores map[string]chan struct{}
}

func NewBulkheadExecutor(pools map[string]int) *BulkheadExecutor {
    be := &BulkheadExecutor{
        semaphores: make(map[string]chan struct{}),
    }

    for name, size := range pools {
        be.semaphores[name] = make(chan struct{}, size)
    }

    return be
}

func (be *BulkheadExecutor) Execute(pool string, fn func() error) error {
    sem := be.semaphores[pool]

    // Acquire permit
    select {
    case sem <- struct{}{}:
        defer func() { <-sem }()
        return fn()
    default:
        return errors.New("bulkhead full: too many concurrent operations")
    }
}

Usage:

executor := NewBulkheadExecutor(map[string]int{
    "database":  20,  // Max 20 concurrent DB calls
    "hsm":       10,  // Max 10 concurrent HSM calls
    "external":  5,   // Max 5 concurrent external API calls
})

// Database calls can't consume all resources
err := executor.Execute("database", func() error {
    return saveToDatabase(data)
})

// HSM calls have separate resource pool
err = executor.Execute("hsm", func() error {
    return hsm.Encrypt(keyID, data)
})

This prevents one slow dependency from consuming all threads/connections.

Pattern 5: Idempotent Operations

In distributed systems, operations may be retried. Design them to be idempotent.

type IdempotentKeyStore struct {
    store map[string]*Key
    mu    sync.RWMutex
}

func (ks *IdempotentKeyStore) CreateKey(keyID string, keyData []byte) error {
    ks.mu.Lock()
    defer ks.mu.Unlock()

    // Check if key already exists
    if existing, exists := ks.store[keyID]; exists {
        // Idempotent: if key matches, succeed
        if bytes.Equal(existing.Data, keyData) {
            return nil
        }
        // Key exists but data differs - conflict
        return errors.New("key already exists with different data")
    }

    // Create key
    ks.store[keyID] = &Key{
        ID:        keyID,
        Data:      keyData,
        CreatedAt: time.Now(),
    }

    return nil
}

With idempotent operations, retries don’t cause duplicate data or inconsistent state.

Pattern 6: Eventual Consistency

Accept that distributed data won’t always be immediately consistent.

type EventuallyConsistentCache struct {
    local          map[string]interface{}
    remote         RemoteStore
    syncInterval   time.Duration
    mu             sync.RWMutex
}

func (ecc *EventuallyConsistentCache) Start() {
    // Periodically sync with remote store
    ticker := time.NewTicker(ecc.syncInterval)
    go func() {
        for range ticker.C {
            ecc.sync()
        }
    }()
}

func (ecc *EventuallyConsistentCache) sync() {
    // Fetch latest from remote
    data, err := ecc.remote.GetAll()
    if err != nil {
        log.Error("sync failed", "error", err)
        return
    }

    // Update local cache
    ecc.mu.Lock()
    ecc.local = data
    ecc.mu.Unlock()

    log.Info("cache synced", "items", len(data))
}

func (ecc *EventuallyConsistentCache) Get(key string) (interface{}, error) {
    ecc.mu.RLock()
    defer ecc.mu.RUnlock()

    value, exists := ecc.local[key]
    if !exists {
        return nil, errors.New("key not found")
    }

    return value, nil
}

The cache may be slightly stale, but reads are fast and don’t require remote calls.

Pattern 7: Saga Pattern for Distributed Transactions

Distributed transactions across services are hard. Use sagas instead.

type OrderSaga struct {
    orderService    *OrderService
    paymentService  *PaymentService
    inventoryService *InventoryService
}

func (saga *OrderSaga) PlaceOrder(order *Order) error {
    var paymentID string
    var reservationID string

    // Step 1: Create order
    orderID, err := saga.orderService.CreateOrder(order)
    if err != nil {
        return err
    }

    // Step 2: Reserve inventory
    reservationID, err = saga.inventoryService.Reserve(order.Items)
    if err != nil {
        // Compensate: delete order
        saga.orderService.DeleteOrder(orderID)
        return err
    }

    // Step 3: Process payment
    paymentID, err = saga.paymentService.Charge(order.Amount)
    if err != nil {
        // Compensate: release inventory and delete order
        saga.inventoryService.Release(reservationID)
        saga.orderService.DeleteOrder(orderID)
        return err
    }

    // Step 4: Confirm order
    err = saga.orderService.ConfirmOrder(orderID, paymentID)
    if err != nil {
        // Compensate: refund, release inventory, delete order
        saga.paymentService.Refund(paymentID)
        saga.inventoryService.Release(reservationID)
        saga.orderService.DeleteOrder(orderID)
        return err
    }

    return nil
}

Each step has a compensating action. If any step fails, previous steps are compensated.

Pattern 8: CQRS (Command Query Responsibility Segregation)

Separate write and read models for better scalability.

// Write model - optimized for updates
type KeyCommandService struct {
    db *Database
}

func (kcs *KeyCommandService) CreateKey(key *Key) error {
    // Validate
    if err := key.Validate(); err != nil {
        return err
    }

    // Write to database
    err := kcs.db.Insert(key)
    if err != nil {
        return err
    }

    // Publish event for read model to consume
    eventBus.Publish(KeyCreatedEvent{
        KeyID:     key.ID,
        CreatedAt: key.CreatedAt,
        Metadata:  key.Metadata,
    })

    return nil
}

// Read model - optimized for queries
type KeyQueryService struct {
    cache *Cache
}

func (kqs *KeyQueryService) GetKey(keyID string) (*Key, error) {
    // Read from cache (fast)
    return kqs.cache.Get(keyID)
}

func (kqs *KeyQueryService) HandleKeyCreatedEvent(event KeyCreatedEvent) {
    // Update read model
    kqs.cache.Set(event.KeyID, &Key{
        ID:        event.KeyID,
        CreatedAt: event.CreatedAt,
        Metadata:  event.Metadata,
    })
}

Write model handles updates. Read model handles queries. They can use different data stores optimized for their use case.

Pattern 9: Leader Election

In a distributed system, sometimes you need a single coordinator.

type LeaderElection struct {
    nodeID       string
    lease        *Lease
    isLeader     bool
    onBecomeLeader func()
    onLoseLeader   func()
}

func (le *LeaderElection) Campaign() {
    ticker := time.NewTicker(5 * time.Second)
    defer ticker.Stop()

    for range ticker.C {
        // Try to acquire lease
        acquired, err := le.lease.TryAcquire(le.nodeID, 10*time.Second)
        if err != nil {
            log.Error("lease acquisition failed", "error", err)
            continue
        }

        if acquired && !le.isLeader {
            // Became leader
            le.isLeader = true
            log.Info("became leader", "node_id", le.nodeID)
            if le.onBecomeLeader != nil {
                go le.onBecomeLeader()
            }
        } else if !acquired && le.isLeader {
            // Lost leadership
            le.isLeader = false
            log.Info("lost leadership", "node_id", le.nodeID)
            if le.onLoseLeader != nil {
                go le.onLoseLeader()
            }
        }
    }
}

// Usage
election := &LeaderElection{
    nodeID: "node-1",
    lease:  distributedLease,
    onBecomeLeader: func() {
        // Start leader-only tasks (background jobs, etc.)
        startCronJobs()
    },
    onLoseLeader: func() {
        // Stop leader tasks
        stopCronJobs()
    },
}

go election.Campaign()

Use this for tasks that should run on exactly one node (cron jobs, cleanup tasks).

Pattern 10: Request Hedging

Send duplicate requests to reduce tail latency.

func HedgedRequest(fn func() (interface{}, error), delay time.Duration) (interface{}, error) {
    resultChan := make(chan interface{}, 2)
    errorChan := make(chan error, 2)

    // Start first request
    go func() {
        result, err := fn()
        if err != nil {
            errorChan <- err
            return
        }
        resultChan <- result
    }()

    // Start backup request after delay
    time.AfterFunc(delay, func() {
        result, err := fn()
        if err != nil {
            errorChan <- err
            return
        }
        resultChan <- result
    })

    // Return first successful result
    select {
    case result := <-resultChan:
        return result, nil
    case err := <-errorChan:
        // First request failed, wait for backup
        select {
        case result := <-resultChan:
            return result, nil
        case err2 := <-errorChan:
            return nil, fmt.Errorf("both requests failed: %w, %w", err, err2)
        }
    }
}

If first request is slow, second request might finish first. This reduces P99 latency at the cost of increased load.

Observability Patterns

Distributed Tracing

Track requests across services:

func (s *Service) HandleRequest(ctx context.Context, req *Request) (*Response, error) {
    // Start span
    span, ctx := opentracing.StartSpanFromContext(ctx, "handle_request")
    defer span.Finish()

    // Add metadata
    span.SetTag("request_id", req.ID)
    span.SetTag("user_id", req.UserID)

    // Call downstream service (span context propagates)
    key, err := s.keyService.GetKey(ctx, req.KeyID)
    if err != nil {
        span.SetTag("error", true)
        span.LogFields(
            log.String("event", "error"),
            log.String("message", err.Error()),
        )
        return nil, err
    }

    // Process request
    result := s.process(ctx, req, key)

    span.SetTag("result_size", len(result))

    return result, nil
}

Structured Logging with Context

func (s *Service) ProcessOrder(ctx context.Context, order *Order) error {
    logger := log.WithContext(ctx).WithFields(log.Fields{
        "order_id": order.ID,
        "user_id":  order.UserID,
        "trace_id": getTraceID(ctx),
    })

    logger.Info("processing order")

    // Log at each step
    logger.Debug("validating order")
    if err := s.validate(order); err != nil {
        logger.Error("validation failed", "error", err)
        return err
    }

    logger.Debug("saving order")
    if err := s.save(order); err != nil {
        logger.Error("save failed", "error", err)
        return err
    }

    logger.Info("order processed successfully")
    return nil
}

Anti-Patterns to Avoid

Distributed Monolith

Don’t make every operation synchronously call multiple services. This creates a distributed monolith with all the complexity of microservices and none of the benefits.

Chatty Services

Minimize inter-service calls. Batch when possible. Use caching.

Ignoring Partial Failures

A service can be partially available (some instances healthy, some not). Design for this.

Synchronous Cascades

Service A calls B calls C calls D… If any service is slow, the whole chain is slow. Use async messaging where possible.

Lessons Learned

Embrace Async: Synchronous calls create tight coupling and cascading failures. Use message queues where possible.

Design for Failure: Every remote call will fail eventually. Handle it gracefully.

Observability is Essential: You can’t debug distributed systems without good logging, metrics, and tracing.

Start Simple: Don’t build Netflix-scale from day one. Start with simple patterns and add complexity as needed.

Test Failure Scenarios: Chaos engineering - randomly kill services in test environments to verify resilience.

Conclusion

Distributed systems are hard. These patterns help, but they’re not silver bullets. Each pattern has tradeoffs.

The key is understanding the failure modes of distributed systems and designing for them:

Network partitions
Service failures
Latency spikes
Data inconsistency
Partial failures

Use these patterns as tools in your toolbox. Apply them where they make sense. Don’t over-engineer.

Most importantly: test your system under failure conditions. Kill services randomly. Introduce latency. Simulate network partitions. See what breaks and fix it.

Building reliable distributed systems is an iterative process. Start simple, measure, learn, improve.

In future posts, I’ll dive deeper into specific topics: consistency models, consensus algorithms, and chaos engineering practices.

The journey to mastering distributed systems is long, but it’s worth it. These are the systems of the future.

Stay resilient!

The Fundamental Truth

Pattern 1: Circuit Breaker

Pattern 2: Retry with Exponential Backoff

Pattern 3: Timeout Everything

Pattern 4: Bulkhead Pattern

Pattern 5: Idempotent Operations

Pattern 6: Eventual Consistency

Pattern 7: Saga Pattern for Distributed Transactions

Pattern 8: CQRS (Command Query Responsibility Segregation)

Pattern 9: Leader Election

Pattern 10: Request Hedging

Observability Patterns

Distributed Tracing

Structured Logging with Context

Anti-Patterns to Avoid

Distributed Monolith

Chatty Services

Ignoring Partial Failures

Synchronous Cascades

Lessons Learned

Conclusion

Related Posts

Security Considerations for Microservices Architecture

Engineering Team Structure and Conway's Law: Architecting for Alignment

Distributed Tracing in Production: Architecture and Design Decisions