I’ve debugged enough production incidents to know that having metrics alone, or logs alone, or traces alone isn’t enough. The real power comes from connecting these three pillars of observability. When you can jump from a latency spike in your metrics, to the specific traces showing slow requests, to the logs revealing the root causeβ€”that’s when you can resolve incidents quickly.

Today, I want to share how to build unified observability that connects metrics, logs, and traces into a coherent system for understanding your infrastructure.

The Three Pillars

Before we unify them, let’s clarify what each pillar provides:

Metrics: Aggregated numerical data over time. β€œWhat’s my 95th percentile latency?” β€œHow many errors per second?”

Logs: Discrete event records. β€œWhat happened in this request?” β€œWhat was the error message?”

Traces: Request flow through distributed systems. β€œWhich service is slow?” β€œWhere did this request spend its time?”

Each answers different questions. The key is linking them so you can navigate between perspectives.

Unified Context with Correlation IDs

The foundation of unified observability is correlation: connecting metrics, logs, and traces that belong to the same request.

type RequestContext struct {
    RequestID  string    // Unique ID for this request
    TraceID    string    // Distributed trace ID
    SpanID     string    // Current span ID
    UserID     string    // User making the request
    SessionID  string    // User session
    Timestamp  time.Time
}

// Middleware to create and propagate context
func ContextMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        ctx := &RequestContext{
            RequestID: r.Header.Get("X-Request-ID"),
            TraceID:   r.Header.Get("X-Trace-ID"),
            UserID:    getUserID(r),
            SessionID: getSessionID(r),
            Timestamp: time.Now(),
        }

        // Generate IDs if not present
        if ctx.RequestID == "" {
            ctx.RequestID = generateID()
        }
        if ctx.TraceID == "" {
            ctx.TraceID = generateID()
        }

        // Add to request context
        r = r.WithContext(context.WithValue(r.Context(), "request_context", ctx))

        // Add to response headers for client-side correlation
        w.Header().Set("X-Request-ID", ctx.RequestID)

        next.ServeHTTP(w, r)
    })
}

Now every log, metric, and trace can reference the same request ID.

Structured Logging with Context

Traditional logs are unstructured text, making them hard to query and correlate. Use structured logging:

type StructuredLogger struct {
    logger *log.Logger
}

type LogEntry struct {
    Timestamp   time.Time              `json:"timestamp"`
    Level       string                 `json:"level"`
    Message     string                 `json:"message"`
    RequestID   string                 `json:"request_id,omitempty"`
    TraceID     string                 `json:"trace_id,omitempty"`
    SpanID      string                 `json:"span_id,omitempty"`
    UserID      string                 `json:"user_id,omitempty"`
    Service     string                 `json:"service"`
    Fields      map[string]interface{} `json:"fields,omitempty"`
}

func (l *StructuredLogger) Info(ctx context.Context, message string, fields map[string]interface{}) {
    entry := l.buildEntry(ctx, "INFO", message, fields)
    l.logger.Println(entry.ToJSON())
}

func (l *StructuredLogger) Error(ctx context.Context, message string, err error, fields map[string]interface{}) {
    if fields == nil {
        fields = make(map[string]interface{})
    }
    fields["error"] = err.Error()

    entry := l.buildEntry(ctx, "ERROR", message, fields)
    l.logger.Println(entry.ToJSON())
}

func (l *StructuredLogger) buildEntry(ctx context.Context, level, message string, fields map[string]interface{}) *LogEntry {
    entry := &LogEntry{
        Timestamp: time.Now(),
        Level:     level,
        Message:   message,
        Service:   "order-service",
        Fields:    fields,
    }

    // Extract context if available
    if reqCtx := getRequestContext(ctx); reqCtx != nil {
        entry.RequestID = reqCtx.RequestID
        entry.TraceID = reqCtx.TraceID
        entry.UserID = reqCtx.UserID
    }

    // Extract span context from OpenTracing
    if span := opentracing.SpanFromContext(ctx); span != nil {
        entry.SpanID = getSpanID(span)
        entry.TraceID = getTraceID(span)
    }

    return entry
}

func (e *LogEntry) ToJSON() string {
    data, _ := json.Marshal(e)
    return string(data)
}

Now your logs contain all the correlation IDs needed to link them to traces and metrics:

{
  "timestamp": "2017-09-20T14:23:45Z",
  "level": "ERROR",
  "message": "Payment processing failed",
  "request_id": "req-abc-123",
  "trace_id": "trace-xyz-789",
  "span_id": "span-def-456",
  "user_id": "user-12345",
  "service": "order-service",
  "fields": {
    "error": "connection timeout",
    "payment_amount": 99.99,
    "retry_attempt": 3
  }
}

Metrics with Labels

Metrics need labels (tags) that allow correlation:

type MetricsCollector struct {
    registry *prometheus.Registry
}

func (m *MetricsCollector) RecordRequest(ctx context.Context, duration time.Duration, statusCode int) {
    reqCtx := getRequestContext(ctx)

    // Record latency histogram with labels
    requestDuration.WithLabelValues(
        "order-service",           // service
        r.Method,                  // method
        r.URL.Path,                // endpoint
        strconv.Itoa(statusCode),  // status_code
        reqCtx.UserID,             // user_id (for per-user analysis)
    ).Observe(duration.Seconds())

    // Increment request counter
    requestTotal.WithLabelValues(
        "order-service",
        r.Method,
        r.URL.Path,
        strconv.Itoa(statusCode),
    ).Inc()

    // If this is an error, record error metric with trace_id
    if statusCode >= 500 {
        errorTotal.WithLabelValues(
            "order-service",
            strconv.Itoa(statusCode),
            reqCtx.TraceID,  // Include trace ID for correlation
        ).Inc()
    }
}

When you see an error spike in your metrics, the trace IDs let you find the specific failing requests.

Connecting Traces to Logs

Emit logs within traced operations:

func processOrder(ctx context.Context, order *Order) error {
    span, ctx := opentracing.StartSpanFromContext(ctx, "process_order")
    defer span.Finish()

    logger.Info(ctx, "Processing order", map[string]interface{}{
        "order_id": order.ID,
        "amount":   order.Total,
    })

    // Validate order
    if err := validateOrder(ctx, order); err != nil {
        logger.Error(ctx, "Order validation failed", err, map[string]interface{}{
            "order_id": order.ID,
        })
        ext.Error.Set(span, true)
        span.LogKV("error", err.Error())
        return err
    }

    // Process payment
    if err := processPayment(ctx, order); err != nil {
        logger.Error(ctx, "Payment processing failed", err, map[string]interface{}{
            "order_id": order.ID,
            "amount":   order.Total,
        })
        ext.Error.Set(span, true)
        span.LogKV("error", err.Error())
        return err
    }

    logger.Info(ctx, "Order processed successfully", map[string]interface{}{
        "order_id": order.ID,
    })

    return nil
}

Now your logs have the trace ID and span ID. In your tracing UI, you can jump directly to relevant logs.

Exemplars: Linking Metrics to Traces

Exemplars attach trace IDs to metric samples:

func recordLatencyWithExemplar(ctx context.Context, duration time.Duration) {
    reqCtx := getRequestContext(ctx)

    // Record metric
    latencyHistogram.Observe(duration.Seconds())

    // If this is a slow request, record as exemplar
    if duration > 500*time.Millisecond {
        exemplar := prometheus.Exemplar{
            Value:  duration.Seconds(),
            Labels: prometheus.Labels{
                "trace_id": reqCtx.TraceID,
            },
            Timestamp: time.Now(),
        }

        latencyHistogram.ObserveWithExemplar(duration.Seconds(), exemplar)
    }
}

In your metrics dashboard, when you see a latency spike, exemplars show you specific trace IDs that contributed to it.

Building a Unified Query Interface

With correlation in place, build UIs that let you navigate between pillars:

type ObservabilityQuery struct {
    metricsClient MetricsClient
    logsClient    LogsClient
    tracesClient  TracesClient
}

// Find all data related to a request
func (q *ObservabilityQuery) GetRequestObservability(requestID string) (*RequestObservability, error) {
    result := &RequestObservability{
        RequestID: requestID,
    }

    // Get logs for this request
    logs, err := q.logsClient.Query(LogQuery{
        Field: "request_id",
        Value: requestID,
    })
    if err != nil {
        return nil, err
    }
    result.Logs = logs

    // Extract trace ID from logs
    if len(logs) > 0 {
        traceID := logs[0].TraceID

        // Get distributed trace
        trace, err := q.tracesClient.GetTrace(traceID)
        if err != nil {
            return nil, err
        }
        result.Trace = trace

        // Get metrics for the time window of this request
        start := logs[0].Timestamp.Add(-1 * time.Minute)
        end := logs[0].Timestamp.Add(1 * time.Minute)

        metrics, err := q.metricsClient.QueryRange(MetricsQuery{
            Query: `rate(http_requests_total[1m])`,
            Start: start,
            End:   end,
        })
        if err != nil {
            return nil, err
        }
        result.Metrics = metrics
    }

    return result, nil
}

type RequestObservability struct {
    RequestID string
    Logs      []LogEntry
    Trace     *Trace
    Metrics   []MetricSample
}

This gives you a single API to get all observability data for a request.

Anomaly Detection Across Pillars

Use metrics to detect anomalies, then automatically pull traces and logs:

type AnomalyDetector struct {
    metricsClient MetricsClient
    tracesClient  TracesClient
    logsClient    LogsClient
    alerter       Alerter
}

func (d *AnomalyDetector) DetectLatencySpikes() error {
    // Get current p95 latency
    current, err := d.metricsClient.Query(`
        histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
        )
    `)
    if err != nil {
        return err
    }

    // Compare to historical baseline
    baseline, err := d.metricsClient.Query(`
        histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[1h] offset 1d)
        )
    `)
    if err != nil {
        return err
    }

    // If latency is 2x normal, investigate
    if current > baseline*2 {
        d.investigateLatencySpike(current, baseline)
    }

    return nil
}

func (d *AnomalyDetector) investigateLatencySpike(current, baseline float64) {
    // Find slow traces from the last 5 minutes
    traces, err := d.tracesClient.FindTraces(TraceQuery{
        Service:       "order-service",
        MinDuration:   time.Duration(current * float64(time.Second)),
        LookbackTime:  5 * time.Minute,
        Limit:         10,
    })
    if err != nil {
        log.Printf("Failed to fetch traces: %v", err)
        return
    }

    // Get logs for these slow requests
    var traceIDs []string
    for _, trace := range traces {
        traceIDs = append(traceIDs, trace.TraceID)
    }

    logs, err := d.logsClient.Query(LogQuery{
        Field:  "trace_id",
        Values: traceIDs,
        Level:  "ERROR",
    })
    if err != nil {
        log.Printf("Failed to fetch logs: %v", err)
        return
    }

    // Analyze common patterns in slow requests
    commonErrors := d.analyzeErrorPatterns(logs)

    // Send alert with context
    d.alerter.Send(Alert{
        Title:    "Latency Spike Detected",
        Severity: "HIGH",
        Details: fmt.Sprintf(
            "P95 latency: %.2fs (baseline: %.2fs)\nCommon errors: %v\nExample traces: %v",
            current, baseline, commonErrors, traceIDs[:3],
        ),
        Links: map[string]string{
            "dashboard": "https://grafana.example.com/...",
            "traces":    "https://jaeger.example.com/...",
            "logs":      "https://kibana.example.com/...",
        },
    })
}

This automates the investigation process, immediately surfacing relevant traces and logs when metrics show an anomaly.

Distributed Context Propagation

Ensure context flows through message queues and async operations:

type Message struct {
    Payload   []byte
    Metadata  MessageMetadata
}

type MessageMetadata struct {
    RequestID string
    TraceID   string
    SpanID    string
    Timestamp time.Time
}

// Producer: Inject context into message
func (p *Producer) PublishOrder(ctx context.Context, order *Order) error {
    reqCtx := getRequestContext(ctx)

    msg := &Message{
        Payload: serializeOrder(order),
        Metadata: MessageMetadata{
            RequestID: reqCtx.RequestID,
            TraceID:   reqCtx.TraceID,
            Timestamp: time.Now(),
        },
    }

    // Create span for async operation
    span := opentracing.StartSpan(
        "publish_order_event",
        opentracing.ChildOf(opentracing.SpanFromContext(ctx).Context()),
    )
    defer span.Finish()

    // Inject span context into message
    carrier := make(opentracing.TextMapCarrier)
    opentracing.GlobalTracer().Inject(
        span.Context(),
        opentracing.TextMap,
        carrier,
    )
    msg.Metadata.SpanContext = carrier

    return p.queue.Publish("orders", msg)
}

// Consumer: Extract context from message
func (c *Consumer) ProcessOrder(msg *Message) error {
    // Extract span context
    carrier := opentracing.TextMapCarrier(msg.Metadata.SpanContext)
    wireContext, _ := opentracing.GlobalTracer().Extract(
        opentracing.TextMap,
        carrier,
    )

    // Create span as child of the published span
    span := opentracing.StartSpan(
        "process_order_event",
        ext.SpanKindRPCServer,
        opentracing.ChildOf(wireContext),
    )
    defer span.Finish()

    ctx := opentracing.ContextWithSpan(context.Background(), span)

    // Reconstruct request context
    reqCtx := &RequestContext{
        RequestID: msg.Metadata.RequestID,
        TraceID:   msg.Metadata.TraceID,
    }
    ctx = context.WithValue(ctx, "request_context", reqCtx)

    // Process with full context
    logger.Info(ctx, "Processing order from queue", map[string]interface{}{
        "message_age": time.Since(msg.Metadata.Timestamp),
    })

    order, err := deserializeOrder(msg.Payload)
    if err != nil {
        logger.Error(ctx, "Failed to deserialize order", err, nil)
        return err
    }

    return processOrder(ctx, order)
}

Now async operations show up in your distributed traces, and logs from async processing have the same correlation IDs.

Observability Dashboard

Build a unified dashboard that shows all three pillars:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Metrics (Last 1 hour)                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Request Rate:  1,234 req/s  ↑ 5%                β”‚   β”‚
β”‚  β”‚ Error Rate:    0.2%         ↑ anomaly detected  β”‚   β”‚
β”‚  β”‚ P95 Latency:   245ms        ↑ 2x baseline       β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Recent Errors (Logs)                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ 14:23:45 Payment timeout [trace: xyz-789] ──────┼─┐ β”‚
β”‚  β”‚ 14:23:12 Database connection lost               β”‚ β”‚ β”‚
β”‚  β”‚ 14:22:58 Invalid user token                     β”‚ β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€
β”‚  Slow Traces                                           β”‚ β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚  β”‚ Trace xyz-789 (850ms) β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”˜ β”‚
β”‚  β”‚  β”œβ”€ order-service: 245ms                        β”‚   β”‚
β”‚  β”‚  β”œβ”€ payment-service: 580ms (timeout)            β”‚   β”‚
β”‚  β”‚  └─ notification-service: 25ms                  β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Click an error in logs, jump to its trace. Click a spike in metrics, see example traces that contributed.

Best Practices

Consistent correlation IDs: Use the same ID scheme across all systems. Don’t have some services using UUIDs and others using integers.

Context propagation everywhere: Don’t break the chain. Every service call, message publish, and background job should propagate context.

Sampling coordination: Sample traces and logs together. If you sample 1% of traces, sample 100% of logs for those traces.

Structured everything: Structured logs, labeled metrics, tagged traces. This enables correlation and querying.

Storage costs: Full observability generates a lot of data. Use sampling, retention policies, and aggregation to manage costs.

Alert correlation: When firing alerts, include links to relevant traces and logs, not just metric graphs.

Looking Forward

The observability landscape is maturing. We’re seeing:

  • OpenTelemetry unifying tracing, metrics, and logs under one standard
  • Better automatic instrumentation requiring less manual code
  • AI/ML analyzing observability data to detect patterns
  • Cheaper storage making full retention more feasible

For modern distributed systems, unified observability isn’t optionalβ€”it’s essential. Metrics alone tell you something is wrong. Traces show you where. Logs explain why. Together, they give you the full picture.

Build correlation in from the start, instrument comprehensively, and create UIs that let you navigate between perspectives. When the 3 AM page comes, you’ll be glad you did.