Building distributed systems is fundamentally different from building monolithic applications. In a monolith, function calls always succeed (or the whole application crashes). In distributed systems, network calls fail. Services go down. Latency spikes. Data gets out of sync.
After a year of building and operating distributed services—key management, encryption, authentication—I’ve learned that success depends on embracing failure as normal and designing for it from the start.
Here are the patterns that have worked for me.
The Fundamental Truth
In a distributed system, failure is not an exception—it’s the norm.
- Networks are unreliable
- Clocks drift
- Services crash
- Latency varies
- Data diverges
Design for these realities, not against them.
Pattern 1: Circuit Breaker
Don’t keep calling a failing service. Fail fast and give it time to recover.
type CircuitBreaker struct {
maxFailures int
timeout time.Duration
state State
failures int
lastAttempt time.Time
mu sync.Mutex
}
type State int
const (
Closed State = iota // Normal operation
Open // Failing, reject requests
HalfOpen // Testing if service recovered
)
func (cb *CircuitBreaker) Call(fn func() error) error {
cb.mu.Lock()
defer cb.mu.Unlock()
// Check if circuit is open
if cb.state == Open {
if time.Since(cb.lastAttempt) < cb.timeout {
return errors.New("circuit breaker open")
}
// Try to recover
cb.state = HalfOpen
}
// Attempt call
err := fn()
if err != nil {
cb.failures++
cb.lastAttempt = time.Now()
if cb.failures >= cb.maxFailures {
cb.state = Open
}
return err
}
// Success - reset circuit breaker
cb.failures = 0
cb.state = Closed
return nil
}
Usage:
func callEncryptionService(data []byte) error {
return circuitBreaker.Call(func() error {
return encryptionClient.Encrypt(data)
})
}
This prevents cascading failures. If a downstream service is failing, stop sending it traffic and fail fast.
Pattern 2: Retry with Exponential Backoff
Network blips happen. Retry transient failures, but don’t hammer failing services.
func RetryWithBackoff(fn func() error, maxRetries int) error {
var err error
for i := 0; i < maxRetries; i++ {
err = fn()
if err == nil {
return nil
}
// Don't retry client errors (4xx)
if isClientError(err) {
return err
}
// Exponential backoff with jitter
backoff := time.Duration(math.Pow(2, float64(i))) * time.Second
jitter := time.Duration(rand.Int63n(int64(backoff / 2)))
time.Sleep(backoff + jitter)
log.Warn("retrying after error",
"attempt", i+1,
"max_retries", maxRetries,
"error", err,
)
}
return fmt.Errorf("max retries exceeded: %w", err)
}
func isClientError(err error) bool {
// Don't retry 4xx errors - they won't succeed on retry
if httpErr, ok := err.(*HTTPError); ok {
return httpErr.StatusCode >= 400 && httpErr.StatusCode < 500
}
return false
}
Usage:
err := RetryWithBackoff(func() error {
return saveToDatabase(data)
}, 3)
The exponential backoff prevents thundering herd problems. Jitter prevents synchronized retries from multiple clients.
Pattern 3: Timeout Everything
Never wait forever. Every remote call needs a timeout.
func callWithTimeout(ctx context.Context, fn func() error, timeout time.Duration) error {
ctx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
errChan := make(chan error, 1)
go func() {
errChan <- fn()
}()
select {
case err := <-errChan:
return err
case <-ctx.Done():
return ctx.Err()
}
}
// Usage
err := callWithTimeout(context.Background(), func() error {
return fetchKeyFromDatabase(keyID)
}, 5*time.Second)
if err == context.DeadlineExceeded {
log.Error("database call timed out")
}
Set timeouts based on SLA requirements. If your API promises 100ms response time, set aggressive timeouts on dependencies.
Pattern 4: Bulkhead Pattern
Isolate resources so failure in one area doesn’t bring down the whole system.
type BulkheadExecutor struct {
semaphores map[string]chan struct{}
}
func NewBulkheadExecutor(pools map[string]int) *BulkheadExecutor {
be := &BulkheadExecutor{
semaphores: make(map[string]chan struct{}),
}
for name, size := range pools {
be.semaphores[name] = make(chan struct{}, size)
}
return be
}
func (be *BulkheadExecutor) Execute(pool string, fn func() error) error {
sem := be.semaphores[pool]
// Acquire permit
select {
case sem <- struct{}{}:
defer func() { <-sem }()
return fn()
default:
return errors.New("bulkhead full: too many concurrent operations")
}
}
Usage:
executor := NewBulkheadExecutor(map[string]int{
"database": 20, // Max 20 concurrent DB calls
"hsm": 10, // Max 10 concurrent HSM calls
"external": 5, // Max 5 concurrent external API calls
})
// Database calls can't consume all resources
err := executor.Execute("database", func() error {
return saveToDatabase(data)
})
// HSM calls have separate resource pool
err = executor.Execute("hsm", func() error {
return hsm.Encrypt(keyID, data)
})
This prevents one slow dependency from consuming all threads/connections.
Pattern 5: Idempotent Operations
In distributed systems, operations may be retried. Design them to be idempotent.
type IdempotentKeyStore struct {
store map[string]*Key
mu sync.RWMutex
}
func (ks *IdempotentKeyStore) CreateKey(keyID string, keyData []byte) error {
ks.mu.Lock()
defer ks.mu.Unlock()
// Check if key already exists
if existing, exists := ks.store[keyID]; exists {
// Idempotent: if key matches, succeed
if bytes.Equal(existing.Data, keyData) {
return nil
}
// Key exists but data differs - conflict
return errors.New("key already exists with different data")
}
// Create key
ks.store[keyID] = &Key{
ID: keyID,
Data: keyData,
CreatedAt: time.Now(),
}
return nil
}
With idempotent operations, retries don’t cause duplicate data or inconsistent state.
Pattern 6: Eventual Consistency
Accept that distributed data won’t always be immediately consistent.
type EventuallyConsistentCache struct {
local map[string]interface{}
remote RemoteStore
syncInterval time.Duration
mu sync.RWMutex
}
func (ecc *EventuallyConsistentCache) Start() {
// Periodically sync with remote store
ticker := time.NewTicker(ecc.syncInterval)
go func() {
for range ticker.C {
ecc.sync()
}
}()
}
func (ecc *EventuallyConsistentCache) sync() {
// Fetch latest from remote
data, err := ecc.remote.GetAll()
if err != nil {
log.Error("sync failed", "error", err)
return
}
// Update local cache
ecc.mu.Lock()
ecc.local = data
ecc.mu.Unlock()
log.Info("cache synced", "items", len(data))
}
func (ecc *EventuallyConsistentCache) Get(key string) (interface{}, error) {
ecc.mu.RLock()
defer ecc.mu.RUnlock()
value, exists := ecc.local[key]
if !exists {
return nil, errors.New("key not found")
}
return value, nil
}
The cache may be slightly stale, but reads are fast and don’t require remote calls.
Pattern 7: Saga Pattern for Distributed Transactions
Distributed transactions across services are hard. Use sagas instead.
type OrderSaga struct {
orderService *OrderService
paymentService *PaymentService
inventoryService *InventoryService
}
func (saga *OrderSaga) PlaceOrder(order *Order) error {
var paymentID string
var reservationID string
// Step 1: Create order
orderID, err := saga.orderService.CreateOrder(order)
if err != nil {
return err
}
// Step 2: Reserve inventory
reservationID, err = saga.inventoryService.Reserve(order.Items)
if err != nil {
// Compensate: delete order
saga.orderService.DeleteOrder(orderID)
return err
}
// Step 3: Process payment
paymentID, err = saga.paymentService.Charge(order.Amount)
if err != nil {
// Compensate: release inventory and delete order
saga.inventoryService.Release(reservationID)
saga.orderService.DeleteOrder(orderID)
return err
}
// Step 4: Confirm order
err = saga.orderService.ConfirmOrder(orderID, paymentID)
if err != nil {
// Compensate: refund, release inventory, delete order
saga.paymentService.Refund(paymentID)
saga.inventoryService.Release(reservationID)
saga.orderService.DeleteOrder(orderID)
return err
}
return nil
}
Each step has a compensating action. If any step fails, previous steps are compensated.
Pattern 8: CQRS (Command Query Responsibility Segregation)
Separate write and read models for better scalability.
// Write model - optimized for updates
type KeyCommandService struct {
db *Database
}
func (kcs *KeyCommandService) CreateKey(key *Key) error {
// Validate
if err := key.Validate(); err != nil {
return err
}
// Write to database
err := kcs.db.Insert(key)
if err != nil {
return err
}
// Publish event for read model to consume
eventBus.Publish(KeyCreatedEvent{
KeyID: key.ID,
CreatedAt: key.CreatedAt,
Metadata: key.Metadata,
})
return nil
}
// Read model - optimized for queries
type KeyQueryService struct {
cache *Cache
}
func (kqs *KeyQueryService) GetKey(keyID string) (*Key, error) {
// Read from cache (fast)
return kqs.cache.Get(keyID)
}
func (kqs *KeyQueryService) HandleKeyCreatedEvent(event KeyCreatedEvent) {
// Update read model
kqs.cache.Set(event.KeyID, &Key{
ID: event.KeyID,
CreatedAt: event.CreatedAt,
Metadata: event.Metadata,
})
}
Write model handles updates. Read model handles queries. They can use different data stores optimized for their use case.
Pattern 9: Leader Election
In a distributed system, sometimes you need a single coordinator.
type LeaderElection struct {
nodeID string
lease *Lease
isLeader bool
onBecomeLeader func()
onLoseLeader func()
}
func (le *LeaderElection) Campaign() {
ticker := time.NewTicker(5 * time.Second)
defer ticker.Stop()
for range ticker.C {
// Try to acquire lease
acquired, err := le.lease.TryAcquire(le.nodeID, 10*time.Second)
if err != nil {
log.Error("lease acquisition failed", "error", err)
continue
}
if acquired && !le.isLeader {
// Became leader
le.isLeader = true
log.Info("became leader", "node_id", le.nodeID)
if le.onBecomeLeader != nil {
go le.onBecomeLeader()
}
} else if !acquired && le.isLeader {
// Lost leadership
le.isLeader = false
log.Info("lost leadership", "node_id", le.nodeID)
if le.onLoseLeader != nil {
go le.onLoseLeader()
}
}
}
}
// Usage
election := &LeaderElection{
nodeID: "node-1",
lease: distributedLease,
onBecomeLeader: func() {
// Start leader-only tasks (background jobs, etc.)
startCronJobs()
},
onLoseLeader: func() {
// Stop leader tasks
stopCronJobs()
},
}
go election.Campaign()
Use this for tasks that should run on exactly one node (cron jobs, cleanup tasks).
Pattern 10: Request Hedging
Send duplicate requests to reduce tail latency.
func HedgedRequest(fn func() (interface{}, error), delay time.Duration) (interface{}, error) {
resultChan := make(chan interface{}, 2)
errorChan := make(chan error, 2)
// Start first request
go func() {
result, err := fn()
if err != nil {
errorChan <- err
return
}
resultChan <- result
}()
// Start backup request after delay
time.AfterFunc(delay, func() {
result, err := fn()
if err != nil {
errorChan <- err
return
}
resultChan <- result
})
// Return first successful result
select {
case result := <-resultChan:
return result, nil
case err := <-errorChan:
// First request failed, wait for backup
select {
case result := <-resultChan:
return result, nil
case err2 := <-errorChan:
return nil, fmt.Errorf("both requests failed: %w, %w", err, err2)
}
}
}
If first request is slow, second request might finish first. This reduces P99 latency at the cost of increased load.
Observability Patterns
Distributed Tracing
Track requests across services:
func (s *Service) HandleRequest(ctx context.Context, req *Request) (*Response, error) {
// Start span
span, ctx := opentracing.StartSpanFromContext(ctx, "handle_request")
defer span.Finish()
// Add metadata
span.SetTag("request_id", req.ID)
span.SetTag("user_id", req.UserID)
// Call downstream service (span context propagates)
key, err := s.keyService.GetKey(ctx, req.KeyID)
if err != nil {
span.SetTag("error", true)
span.LogFields(
log.String("event", "error"),
log.String("message", err.Error()),
)
return nil, err
}
// Process request
result := s.process(ctx, req, key)
span.SetTag("result_size", len(result))
return result, nil
}
Structured Logging with Context
func (s *Service) ProcessOrder(ctx context.Context, order *Order) error {
logger := log.WithContext(ctx).WithFields(log.Fields{
"order_id": order.ID,
"user_id": order.UserID,
"trace_id": getTraceID(ctx),
})
logger.Info("processing order")
// Log at each step
logger.Debug("validating order")
if err := s.validate(order); err != nil {
logger.Error("validation failed", "error", err)
return err
}
logger.Debug("saving order")
if err := s.save(order); err != nil {
logger.Error("save failed", "error", err)
return err
}
logger.Info("order processed successfully")
return nil
}
Anti-Patterns to Avoid
Distributed Monolith
Don’t make every operation synchronously call multiple services. This creates a distributed monolith with all the complexity of microservices and none of the benefits.
Chatty Services
Minimize inter-service calls. Batch when possible. Use caching.
Ignoring Partial Failures
A service can be partially available (some instances healthy, some not). Design for this.
Synchronous Cascades
Service A calls B calls C calls D… If any service is slow, the whole chain is slow. Use async messaging where possible.
Lessons Learned
Embrace Async: Synchronous calls create tight coupling and cascading failures. Use message queues where possible.
Design for Failure: Every remote call will fail eventually. Handle it gracefully.
Observability is Essential: You can’t debug distributed systems without good logging, metrics, and tracing.
Start Simple: Don’t build Netflix-scale from day one. Start with simple patterns and add complexity as needed.
Test Failure Scenarios: Chaos engineering - randomly kill services in test environments to verify resilience.
Conclusion
Distributed systems are hard. These patterns help, but they’re not silver bullets. Each pattern has tradeoffs.
The key is understanding the failure modes of distributed systems and designing for them:
- Network partitions
- Service failures
- Latency spikes
- Data inconsistency
- Partial failures
Use these patterns as tools in your toolbox. Apply them where they make sense. Don’t over-engineer.
Most importantly: test your system under failure conditions. Kill services randomly. Introduce latency. Simulate network partitions. See what breaks and fix it.
Building reliable distributed systems is an iterative process. Start simple, measure, learn, improve.
In future posts, I’ll dive deeper into specific topics: consistency models, consensus algorithms, and chaos engineering practices.
The journey to mastering distributed systems is long, but it’s worth it. These are the systems of the future.
Stay resilient!