Observability in Distributed Systems: Beyond Logging and Monitoring

Our microservices architecture grew from 5 services to 40 services in a year. Deployments are faster, teams are autonomous, and we ship features more quickly.

But when something breaks, troubleshooting is a nightmare.

A user reports a slow request. Where do you start? The request touched 12 services, 3 databases, 2 message queues, and 5 third-party APIs. Traditional logging and monitoring don’t cut it anymore.

This is the observability problem in distributed systems.

Over the past six months, we’ve built observability into our architecture. Not just monitoring what we think might fail, but instrumenting systems to answer questions we haven’t thought of yet.

Here’s what observability actually means and how to build it.

Monitoring vs. Observability

First, let’s clarify terms. They’re not the same thing.

Monitoring is watching known failure modes:

Is the CPU usage high?
Is the disk full?
Is the service responding to health checks?

You define metrics upfront. You set alerts for thresholds. This works great for known failure scenarios.

Observability is understanding system behavior by examining outputs:

Why is this particular user’s request slow?
Which service is causing latency spikes?
How does this request flow through the system?

You don’t need to predict failures. The system provides enough data to debug any issue.

The difference:

Monitoring: "The API has high latency"
Observability: "The API is slow because the database query for user 12345's
               orders is timing out due to a missing index on the orders table"

Monitoring tells you something is wrong. Observability tells you why.

The Three Pillars of Observability

Observability rests on three pillars: metrics, logs, and traces.

Pillar 1: Metrics

Metrics are numerical measurements over time:

Request count
Response time
Error rate
CPU usage
Memory consumption

Metrics are cheap to collect and store. They’re perfect for dashboards and alerting.

// Instrumentation with Prometheus
var (
    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request latency",
            Buckets: []float64{.001, .005, .01, .05, .1, .5, 1, 5},
        },
        []string{"method", "endpoint", "status"},
    )

    requestCount = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )
)

func instrumentHandler(handler http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Wrap response writer to capture status
        wrapper := &responseWriter{ResponseWriter: w, status: 200}

        handler(wrapper, r)

        duration := time.Since(start).Seconds()
        status := strconv.Itoa(wrapper.status)

        requestDuration.WithLabelValues(r.Method, r.URL.Path, status).Observe(duration)
        requestCount.WithLabelValues(r.Method, r.URL.Path, status).Inc()
    }
}

Key metrics for every service:

RED metrics: Rate, Errors, Duration
USE metrics: Utilization, Saturation, Errors

Metrics answer: “What is happening?”

Pillar 2: Logs

Logs are discrete events:

2016-09-22T10:15:30.456Z INFO Request started method=GET path=/api/users/123
2016-09-22T10:15:30.512Z INFO Database query executed query="SELECT * FROM users WHERE id=?" duration=45ms
2016-09-22T10:15:30.558Z ERROR Failed to fetch user user_id=123 error="connection timeout"

Use structured logging:

import "github.com/sirupsen/logrus"

log := logrus.WithFields(logrus.Fields{
    "user_id":    userID,
    "request_id": requestID,
    "service":    "api",
})

log.Info("Processing user request")

// Later in the code
log.WithFields(logrus.Fields{
    "query":    query,
    "duration": duration,
}).Info("Database query executed")

Structured logs are machine-readable. You can query them:

# Find all slow database queries
query="SELECT * FROM users WHERE id=?" AND duration > 100ms

# Find all errors for a specific user
user_id=123 AND level=ERROR

Logs answer: “What happened and what was the context?”

Pillar 3: Traces

Traces show request flow through distributed systems:

Request ID: abc123

[Frontend] GET /orders → 45ms
    ↓
[API Gateway] GET /api/orders → 820ms
    ↓
    ├─ [Order Service] GET /orders/user/123 → 780ms
    │   ↓
    │   ├─ [Database] SELECT orders → 15ms
    │   ├─ [Product Service] GET /products/456 → 250ms
    │   │   └─ [Cache] GET product:456 → 2ms
    │   ├─ [Product Service] GET /products/789 → 500ms ⚠️ SLOW
    │   │   └─ [Database] SELECT products → 485ms ⚠️ SLOW
    │   └─ [User Service] GET /users/123 → 10ms
    └─ [Analytics] track_event → 5ms (async)

Traces show exactly where time is spent. In this example, we see that /products/789 is slow because of a database query.

Implementing distributed tracing:

import (
    "github.com/opentracing/opentracing-go"
    "github.com/uber/jaeger-client-go"
)

// Initialize tracer
func initTracer() opentracing.Tracer {
    cfg := jaeger.Configuration{
        ServiceName: "order-service",
        Sampler: &jaeger.SamplerConfig{
            Type:  "const",
            Param: 1, // Sample all requests
        },
        Reporter: &jaeger.ReporterConfig{
            LogSpans:            true,
            LocalAgentHostPort:  "jaeger:6831",
        },
    }

    tracer, _, _ := cfg.NewTracer()
    return tracer
}

// Trace HTTP handlers
func traceHandler(handler http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        // Extract span context from headers
        spanCtx, _ := opentracing.GlobalTracer().Extract(
            opentracing.HTTPHeaders,
            opentracing.HTTPHeadersCarrier(r.Header),
        )

        // Start new span
        span := opentracing.StartSpan(
            r.URL.Path,
            opentracing.ChildOf(spanCtx),
            opentracing.Tag{Key: "http.method", Value: r.Method},
            opentracing.Tag{Key: "http.url", Value: r.URL.String()},
        )
        defer span.Finish()

        // Add span to context
        ctx := opentracing.ContextWithSpan(r.Context(), span)

        handler(w, r.WithContext(ctx))
    }
}

// Trace database calls
func queryWithTracing(ctx context.Context, query string) (*sql.Rows, error) {
    span, _ := opentracing.StartSpanFromContext(ctx, "database query")
    span.SetTag("query", query)
    defer span.Finish()

    return db.QueryContext(ctx, query)
}

// Trace outbound HTTP calls
func httpCallWithTracing(ctx context.Context, url string) (*http.Response, error) {
    span, ctx := opentracing.StartSpanFromContext(ctx, "http call")
    span.SetTag("url", url)
    defer span.Finish()

    req, _ := http.NewRequest("GET", url, nil)
    req = req.WithContext(ctx)

    // Inject span context into headers
    opentracing.GlobalTracer().Inject(
        span.Context(),
        opentracing.HTTPHeaders,
        opentracing.HTTPHeadersCarrier(req.Header),
    )

    return http.DefaultClient.Do(req)
}

Traces answer: “How did this request flow through the system?”

Correlation: Tying It All Together

The real power comes from correlating metrics, logs, and traces.

Request IDs

Every request gets a unique ID that flows through all services:

func requestIDMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        requestID := r.Header.Get("X-Request-ID")
        if requestID == "" {
            requestID = uuid.New().String()
        }

        // Add to response headers
        w.Header().Set("X-Request-ID", requestID)

        // Add to context
        ctx := context.WithValue(r.Context(), "request_id", requestID)

        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

Propagate request ID to all downstream services:

func callDownstreamService(ctx context.Context, url string) (*http.Response, error) {
    req, _ := http.NewRequest("GET", url, nil)

    // Propagate request ID
    requestID := ctx.Value("request_id").(string)
    req.Header.Set("X-Request-ID", requestID)

    return http.DefaultClient.Do(req)
}

Include request ID in all logs:

log := logrus.WithField("request_id", requestID)
log.Info("Processing request")

Now you can trace a request across all services:

# Find all logs for request abc123
request_id=abc123

# Find trace for request
trace_id=abc123

Unified Observability

Connect all three pillars:

User Report: "My order page is slow"
    ↓
Metrics: Which endpoint is slow?
    → /api/orders has p99 latency of 2s (usually 200ms)
    ↓
Logs: When did this start?
    → High latency started at 10:15 AM
    → Errors from product service increased
    ↓
Traces: Which requests are slow?
    → Request abc123 took 2.5s
    → Product service call took 2.3s
    ↓
Logs: What happened in product service?
    → Database query took 2.2s
    → Query: SELECT * FROM products WHERE id=789
    ↓
Metrics: Is database overloaded?
    → Database CPU is normal
    → Slow query count increased
    ↓
Root Cause: Query missing index on products table

This investigation took 5 minutes. Without observability, it would take hours.

Practical Implementation Patterns

Pattern 1: Context Propagation

Pass context through the entire request chain:

// HTTP handler receives context
func handleOrder(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()

    // Context flows to business logic
    order, err := getOrder(ctx, orderID)
    if err != nil {
        log.WithContext(ctx).Error("Failed to get order", err)
        return
    }

    // Context flows to database
    products, err := getProducts(ctx, order.ProductIDs)
    if err != nil {
        log.WithContext(ctx).Error("Failed to get products", err)
        return
    }

    // Context flows to downstream services
    user, err := getUserService(ctx, order.UserID)
    if err != nil {
        log.WithContext(ctx).Error("Failed to get user", err)
        return
    }
}

// Every function accepts context
func getOrder(ctx context.Context, orderID string) (*Order, error) {
    span, ctx := opentracing.StartSpanFromContext(ctx, "get_order")
    defer span.Finish()

    log := logrus.WithContext(ctx).WithField("order_id", orderID)
    log.Info("Fetching order")

    // Context flows to database
    return db.GetOrder(ctx, orderID)
}

Context carries:

Request ID
Trace span
User information
Timeouts and cancellation

Everything is automatically correlated.

Pattern 2: Structured Logging with Context

Every log entry has rich context:

type Logger struct {
    base *logrus.Entry
}

func (l *Logger) WithContext(ctx context.Context) *logrus.Entry {
    entry := l.base

    // Add request ID
    if requestID := ctx.Value("request_id"); requestID != nil {
        entry = entry.WithField("request_id", requestID)
    }

    // Add trace ID
    if span := opentracing.SpanFromContext(ctx); span != nil {
        entry = entry.WithField("trace_id", span.Context().(jaeger.SpanContext).TraceID())
    }

    // Add user ID
    if userID := ctx.Value("user_id"); userID != nil {
        entry = entry.WithField("user_id", userID)
    }

    return entry
}

// Usage
log.WithContext(ctx).WithFields(logrus.Fields{
    "order_id": orderID,
    "amount":   order.Amount,
}).Info("Order created")

// Output:
// {
//   "level": "info",
//   "msg": "Order created",
//   "request_id": "abc123",
//   "trace_id": "def456",
//   "user_id": "789",
//   "order_id": "ord_001",
//   "amount": 99.99,
//   "timestamp": "2016-09-22T10:15:30Z"
// }

Pattern 3: Metrics by Dimension

Tag metrics with dimensions for drill-down:

// Record metrics with tags
metrics.Histogram("api.latency", duration, map[string]string{
    "endpoint": r.URL.Path,
    "method":   r.Method,
    "status":   strconv.Itoa(status),
    "service":  "order-service",
    "version":  appVersion,
})

metrics.Counter("api.errors", 1, map[string]string{
    "endpoint":   r.URL.Path,
    "error_type": errorType,
    "service":    "order-service",
})

Query by dimension:

# Overall latency
api.latency

# Latency by endpoint
api.latency{endpoint="/api/orders"}

# Latency by endpoint and status
api.latency{endpoint="/api/orders",status="500"}

# Error rate by type
sum(api.errors) by (error_type)

This enables slicing and dicing data for root cause analysis.

Pattern 4: Health Checks with Details

Don’t just return 200/500. Provide details:

type HealthStatus struct {
    Status      string                 `json:"status"`
    Version     string                 `json:"version"`
    Uptime      string                 `json:"uptime"`
    Checks      map[string]CheckResult `json:"checks"`
}

type CheckResult struct {
    Status  string `json:"status"`
    Message string `json:"message,omitempty"`
    Latency string `json:"latency,omitempty"`
}

func healthHandler(w http.ResponseWriter, r *http.Request) {
    health := HealthStatus{
        Status:  "healthy",
        Version: appVersion,
        Uptime:  time.Since(startTime).String(),
        Checks:  make(map[string]CheckResult),
    }

    // Check database
    dbStart := time.Now()
    if err := db.Ping(); err != nil {
        health.Checks["database"] = CheckResult{
            Status:  "unhealthy",
            Message: err.Error(),
        }
        health.Status = "unhealthy"
    } else {
        health.Checks["database"] = CheckResult{
            Status:  "healthy",
            Latency: time.Since(dbStart).String(),
        }
    }

    // Check Redis
    redisStart := time.Now()
    if err := redis.Ping().Err(); err != nil {
        health.Checks["redis"] = CheckResult{
            Status:  "unhealthy",
            Message: err.Error(),
        }
        health.Status = "unhealthy"
    } else {
        health.Checks["redis"] = CheckResult{
            Status:  "healthy",
            Latency: time.Since(redisStart).String(),
        }
    }

    // Check downstream services
    serviceStart := time.Now()
    if err := checkService("http://product-service/health"); err != nil {
        health.Checks["product-service"] = CheckResult{
            Status:  "unhealthy",
            Message: err.Error(),
        }
        health.Status = "degraded" // Still functional but degraded
    } else {
        health.Checks["product-service"] = CheckResult{
            Status:  "healthy",
            Latency: time.Since(serviceStart).String(),
        }
    }

    status := http.StatusOK
    if health.Status == "unhealthy" {
        status = http.StatusServiceUnavailable
    }

    w.WriteHeader(status)
    json.NewEncoder(w).Encode(health)
}

Detailed health checks help diagnose cascading failures.

Distributed System Challenges

Challenge 1: Cascading Failures

One slow service causes everything to slow down:

[Service A] → [Service B] → [Service C]
                              ↓ (slow)
                            Database

Service C is slow. Service B waits. Service A waits. Everything is slow.

Solution: Circuit Breakers

type CircuitBreaker struct {
    maxFailures  int
    resetTimeout time.Duration
    state        string // "closed", "open", "half-open"
    failures     int
    lastFailTime time.Time
}

func (cb *CircuitBreaker) Call(fn func() error) error {
    if cb.state == "open" {
        if time.Since(cb.lastFailTime) > cb.resetTimeout {
            cb.state = "half-open"
        } else {
            return errors.New("circuit breaker open")
        }
    }

    err := fn()

    if err != nil {
        cb.failures++
        cb.lastFailTime = time.Now()

        if cb.failures >= cb.maxFailures {
            cb.state = "open"
            log.Warn("Circuit breaker opened", "failures", cb.failures)
        }
        return err
    }

    if cb.state == "half-open" {
        cb.state = "closed"
        cb.failures = 0
        log.Info("Circuit breaker closed")
    }

    return nil
}

Track circuit breaker state as metrics:

metrics.Gauge("circuit_breaker.state", stateValue, map[string]string{
    "service": serviceName,
})

Challenge 2: Sampling in High-Volume Systems

Tracing every request is expensive at scale.

Solution: Smart Sampling

type Sampler struct {
    baseRate    float64
    errorRate   float64
    slowRate    float64
}

func (s *Sampler) ShouldSample(r *http.Request, duration time.Duration, err error) bool {
    // Always sample errors
    if err != nil {
        return rand.Float64() < s.errorRate
    }

    // Always sample slow requests
    if duration > 1*time.Second {
        return rand.Float64() < s.slowRate
    }

    // Sample normal requests at base rate
    return rand.Float64() < s.baseRate
}

// Usage
sampler := &Sampler{
    baseRate:  0.01,  // 1% of normal requests
    errorRate: 1.0,   // 100% of errors
    slowRate:  1.0,   // 100% of slow requests
}

This captures interesting requests while reducing data volume.

Challenge 3: Clock Skew

Different servers have different times. Trace timings can be wrong.

Solution: Relative Timestamps

type Span struct {
    TraceID   string
    SpanID    string
    ParentID  string
    StartTime time.Time
    Duration  time.Duration // Relative to start, not wall clock
}

Use duration from span start, not absolute timestamps. This eliminates clock skew issues.

Challenge 4: Cardinality Explosion

Too many metric dimensions creates millions of time series:

// Bad: High cardinality
metrics.Counter("api.requests", 1, map[string]string{
    "endpoint": r.URL.Path,        // 100 values
    "user_id":  userID,             // 1,000,000 values ⚠️
    "session":  sessionID,          // Infinite values ⚠️
})

Solution: Limit dimensions

// Good: Bounded cardinality
metrics.Counter("api.requests", 1, map[string]string{
    "endpoint":  r.URL.Path,
    "user_tier": getUserTier(userID), // "free", "pro", "enterprise"
    "status":    strconv.Itoa(status),
})

Use logs/traces for high-cardinality data (user IDs, session IDs).

Building Dashboards

Metrics are useless without visualization.

Service Dashboard Template

Every service gets a standard dashboard:

# Grafana dashboard
dashboard:
  title: "Order Service"
  panels:
    - title: "Request Rate"
      metric: "sum(rate(http_requests_total[5m])) by (endpoint)"

    - title: "Error Rate"
      metric: "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))"

    - title: "Latency (p50, p95, p99)"
      metrics:
        - "histogram_quantile(0.50, http_request_duration_seconds)"
        - "histogram_quantile(0.95, http_request_duration_seconds)"
        - "histogram_quantile(0.99, http_request_duration_seconds)"

    - title: "Dependency Latency"
      metric: "histogram_quantile(0.95, dependency_call_duration_seconds) by (service)"

    - title: "Circuit Breaker State"
      metric: "circuit_breaker_state by (service)"

Every engineer can diagnose their service with the same dashboard layout.

Alerting Best Practices

Alerts should be actionable and non-noisy.

Symptom-Based Alerting

Alert on user-facing symptoms, not internal metrics:

# Bad: Internal metric
- alert: HighCPU
  expr: cpu_usage > 80
  message: "CPU usage is high"
  # So what? Does this affect users?

# Good: User-facing symptom
- alert: HighErrorRate
  expr: sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) > 0.01
  message: "Error rate is {{ $value | humanizePercentage }}, check https://trace.example.com"
  severity: critical

SLO-Based Alerting

Alert when SLOs are at risk:

- alert: LatencySLOBreach
  expr: |
    histogram_quantile(0.95, http_request_duration_seconds) > 0.5
  for: 5m
  message: "p95 latency {{ $value }}s exceeds SLO of 500ms"
  severity: warning

- alert: AvailabilitySLOBreach
  expr: |
    1 - (sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))) < 0.995
  for: 10m
  message: "Availability {{ $value | humanizePercentage }} below SLO of 99.5%"
  severity: critical

Tools We Use

Metrics:

Prometheus (collection and storage)
Grafana (visualization)
StatsD (client-side)

Logs:

Elasticsearch (storage and search)
Logstash/Fluentd (aggregation)
Kibana (visualization)

Traces:

Jaeger (distributed tracing)
Zipkin (alternative to Jaeger)

All-in-One (Alternative):

Datadog
New Relic
Honeycomb

We use open-source tools, but commercial options provide tighter integration.

Practical Rollout Strategy

Phase 1: Instrumentation (Week 1-2)

Add structured logging to all services
Instrument HTTP handlers with metrics
Add health check endpoints

Phase 2: Collection (Week 3-4)

Deploy Prometheus for metrics
Deploy ELK stack for logs
Set up Grafana dashboards

Phase 3: Correlation (Week 5-6)

Add request ID propagation
Implement distributed tracing
Connect metrics, logs, and traces

Phase 4: Alerting (Week 7-8)

Define SLOs
Create symptom-based alerts
Set up on-call rotation

Phase 5: Culture (Ongoing)

Train engineers on observability tools
Include observability in code reviews
Use observability data in post-mortems

Conclusion

Observability is essential for distributed systems. Without it, you’re flying blind.

The key principles:

Instrument everything: Metrics, logs, and traces
Correlate data: Request IDs tie everything together
Focus on symptoms: Alert on user-facing issues
Make it self-service: Engineers debug their own services
Continuous improvement: Use observability data to improve the system

Start small:

Add structured logging
Instrument HTTP handlers
Set up basic dashboards
Add distributed tracing
Iterate and improve

Observability isn’t a project that finishes—it’s an ongoing practice.

In my next post, I’ll cover GDPR compliance preparation for engineers. Privacy regulations are coming, and engineering teams need to prepare now.

Good observability turns unknown unknowns into known knowns. Invest in it early.