Iβve debugged enough production incidents to know that having metrics alone, or logs alone, or traces alone isnβt enough. The real power comes from connecting these three pillars of observability. When you can jump from a latency spike in your metrics, to the specific traces showing slow requests, to the logs revealing the root causeβthatβs when you can resolve incidents quickly.
Today, I want to share how to build unified observability that connects metrics, logs, and traces into a coherent system for understanding your infrastructure.
The Three Pillars
Before we unify them, letβs clarify what each pillar provides:
Metrics: Aggregated numerical data over time. βWhatβs my 95th percentile latency?β βHow many errors per second?β
Logs: Discrete event records. βWhat happened in this request?β βWhat was the error message?β
Traces: Request flow through distributed systems. βWhich service is slow?β βWhere did this request spend its time?β
Each answers different questions. The key is linking them so you can navigate between perspectives.
Unified Context with Correlation IDs
The foundation of unified observability is correlation: connecting metrics, logs, and traces that belong to the same request.
type RequestContext struct {
RequestID string // Unique ID for this request
TraceID string // Distributed trace ID
SpanID string // Current span ID
UserID string // User making the request
SessionID string // User session
Timestamp time.Time
}
// Middleware to create and propagate context
func ContextMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ctx := &RequestContext{
RequestID: r.Header.Get("X-Request-ID"),
TraceID: r.Header.Get("X-Trace-ID"),
UserID: getUserID(r),
SessionID: getSessionID(r),
Timestamp: time.Now(),
}
// Generate IDs if not present
if ctx.RequestID == "" {
ctx.RequestID = generateID()
}
if ctx.TraceID == "" {
ctx.TraceID = generateID()
}
// Add to request context
r = r.WithContext(context.WithValue(r.Context(), "request_context", ctx))
// Add to response headers for client-side correlation
w.Header().Set("X-Request-ID", ctx.RequestID)
next.ServeHTTP(w, r)
})
}
Now every log, metric, and trace can reference the same request ID.
Structured Logging with Context
Traditional logs are unstructured text, making them hard to query and correlate. Use structured logging:
type StructuredLogger struct {
logger *log.Logger
}
type LogEntry struct {
Timestamp time.Time `json:"timestamp"`
Level string `json:"level"`
Message string `json:"message"`
RequestID string `json:"request_id,omitempty"`
TraceID string `json:"trace_id,omitempty"`
SpanID string `json:"span_id,omitempty"`
UserID string `json:"user_id,omitempty"`
Service string `json:"service"`
Fields map[string]interface{} `json:"fields,omitempty"`
}
func (l *StructuredLogger) Info(ctx context.Context, message string, fields map[string]interface{}) {
entry := l.buildEntry(ctx, "INFO", message, fields)
l.logger.Println(entry.ToJSON())
}
func (l *StructuredLogger) Error(ctx context.Context, message string, err error, fields map[string]interface{}) {
if fields == nil {
fields = make(map[string]interface{})
}
fields["error"] = err.Error()
entry := l.buildEntry(ctx, "ERROR", message, fields)
l.logger.Println(entry.ToJSON())
}
func (l *StructuredLogger) buildEntry(ctx context.Context, level, message string, fields map[string]interface{}) *LogEntry {
entry := &LogEntry{
Timestamp: time.Now(),
Level: level,
Message: message,
Service: "order-service",
Fields: fields,
}
// Extract context if available
if reqCtx := getRequestContext(ctx); reqCtx != nil {
entry.RequestID = reqCtx.RequestID
entry.TraceID = reqCtx.TraceID
entry.UserID = reqCtx.UserID
}
// Extract span context from OpenTracing
if span := opentracing.SpanFromContext(ctx); span != nil {
entry.SpanID = getSpanID(span)
entry.TraceID = getTraceID(span)
}
return entry
}
func (e *LogEntry) ToJSON() string {
data, _ := json.Marshal(e)
return string(data)
}
Now your logs contain all the correlation IDs needed to link them to traces and metrics:
{
"timestamp": "2017-09-20T14:23:45Z",
"level": "ERROR",
"message": "Payment processing failed",
"request_id": "req-abc-123",
"trace_id": "trace-xyz-789",
"span_id": "span-def-456",
"user_id": "user-12345",
"service": "order-service",
"fields": {
"error": "connection timeout",
"payment_amount": 99.99,
"retry_attempt": 3
}
}
Metrics with Labels
Metrics need labels (tags) that allow correlation:
type MetricsCollector struct {
registry *prometheus.Registry
}
func (m *MetricsCollector) RecordRequest(ctx context.Context, duration time.Duration, statusCode int) {
reqCtx := getRequestContext(ctx)
// Record latency histogram with labels
requestDuration.WithLabelValues(
"order-service", // service
r.Method, // method
r.URL.Path, // endpoint
strconv.Itoa(statusCode), // status_code
reqCtx.UserID, // user_id (for per-user analysis)
).Observe(duration.Seconds())
// Increment request counter
requestTotal.WithLabelValues(
"order-service",
r.Method,
r.URL.Path,
strconv.Itoa(statusCode),
).Inc()
// If this is an error, record error metric with trace_id
if statusCode >= 500 {
errorTotal.WithLabelValues(
"order-service",
strconv.Itoa(statusCode),
reqCtx.TraceID, // Include trace ID for correlation
).Inc()
}
}
When you see an error spike in your metrics, the trace IDs let you find the specific failing requests.
Connecting Traces to Logs
Emit logs within traced operations:
func processOrder(ctx context.Context, order *Order) error {
span, ctx := opentracing.StartSpanFromContext(ctx, "process_order")
defer span.Finish()
logger.Info(ctx, "Processing order", map[string]interface{}{
"order_id": order.ID,
"amount": order.Total,
})
// Validate order
if err := validateOrder(ctx, order); err != nil {
logger.Error(ctx, "Order validation failed", err, map[string]interface{}{
"order_id": order.ID,
})
ext.Error.Set(span, true)
span.LogKV("error", err.Error())
return err
}
// Process payment
if err := processPayment(ctx, order); err != nil {
logger.Error(ctx, "Payment processing failed", err, map[string]interface{}{
"order_id": order.ID,
"amount": order.Total,
})
ext.Error.Set(span, true)
span.LogKV("error", err.Error())
return err
}
logger.Info(ctx, "Order processed successfully", map[string]interface{}{
"order_id": order.ID,
})
return nil
}
Now your logs have the trace ID and span ID. In your tracing UI, you can jump directly to relevant logs.
Exemplars: Linking Metrics to Traces
Exemplars attach trace IDs to metric samples:
func recordLatencyWithExemplar(ctx context.Context, duration time.Duration) {
reqCtx := getRequestContext(ctx)
// Record metric
latencyHistogram.Observe(duration.Seconds())
// If this is a slow request, record as exemplar
if duration > 500*time.Millisecond {
exemplar := prometheus.Exemplar{
Value: duration.Seconds(),
Labels: prometheus.Labels{
"trace_id": reqCtx.TraceID,
},
Timestamp: time.Now(),
}
latencyHistogram.ObserveWithExemplar(duration.Seconds(), exemplar)
}
}
In your metrics dashboard, when you see a latency spike, exemplars show you specific trace IDs that contributed to it.
Building a Unified Query Interface
With correlation in place, build UIs that let you navigate between pillars:
type ObservabilityQuery struct {
metricsClient MetricsClient
logsClient LogsClient
tracesClient TracesClient
}
// Find all data related to a request
func (q *ObservabilityQuery) GetRequestObservability(requestID string) (*RequestObservability, error) {
result := &RequestObservability{
RequestID: requestID,
}
// Get logs for this request
logs, err := q.logsClient.Query(LogQuery{
Field: "request_id",
Value: requestID,
})
if err != nil {
return nil, err
}
result.Logs = logs
// Extract trace ID from logs
if len(logs) > 0 {
traceID := logs[0].TraceID
// Get distributed trace
trace, err := q.tracesClient.GetTrace(traceID)
if err != nil {
return nil, err
}
result.Trace = trace
// Get metrics for the time window of this request
start := logs[0].Timestamp.Add(-1 * time.Minute)
end := logs[0].Timestamp.Add(1 * time.Minute)
metrics, err := q.metricsClient.QueryRange(MetricsQuery{
Query: `rate(http_requests_total[1m])`,
Start: start,
End: end,
})
if err != nil {
return nil, err
}
result.Metrics = metrics
}
return result, nil
}
type RequestObservability struct {
RequestID string
Logs []LogEntry
Trace *Trace
Metrics []MetricSample
}
This gives you a single API to get all observability data for a request.
Anomaly Detection Across Pillars
Use metrics to detect anomalies, then automatically pull traces and logs:
type AnomalyDetector struct {
metricsClient MetricsClient
tracesClient TracesClient
logsClient LogsClient
alerter Alerter
}
func (d *AnomalyDetector) DetectLatencySpikes() error {
// Get current p95 latency
current, err := d.metricsClient.Query(`
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
`)
if err != nil {
return err
}
// Compare to historical baseline
baseline, err := d.metricsClient.Query(`
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[1h] offset 1d)
)
`)
if err != nil {
return err
}
// If latency is 2x normal, investigate
if current > baseline*2 {
d.investigateLatencySpike(current, baseline)
}
return nil
}
func (d *AnomalyDetector) investigateLatencySpike(current, baseline float64) {
// Find slow traces from the last 5 minutes
traces, err := d.tracesClient.FindTraces(TraceQuery{
Service: "order-service",
MinDuration: time.Duration(current * float64(time.Second)),
LookbackTime: 5 * time.Minute,
Limit: 10,
})
if err != nil {
log.Printf("Failed to fetch traces: %v", err)
return
}
// Get logs for these slow requests
var traceIDs []string
for _, trace := range traces {
traceIDs = append(traceIDs, trace.TraceID)
}
logs, err := d.logsClient.Query(LogQuery{
Field: "trace_id",
Values: traceIDs,
Level: "ERROR",
})
if err != nil {
log.Printf("Failed to fetch logs: %v", err)
return
}
// Analyze common patterns in slow requests
commonErrors := d.analyzeErrorPatterns(logs)
// Send alert with context
d.alerter.Send(Alert{
Title: "Latency Spike Detected",
Severity: "HIGH",
Details: fmt.Sprintf(
"P95 latency: %.2fs (baseline: %.2fs)\nCommon errors: %v\nExample traces: %v",
current, baseline, commonErrors, traceIDs[:3],
),
Links: map[string]string{
"dashboard": "https://grafana.example.com/...",
"traces": "https://jaeger.example.com/...",
"logs": "https://kibana.example.com/...",
},
})
}
This automates the investigation process, immediately surfacing relevant traces and logs when metrics show an anomaly.
Distributed Context Propagation
Ensure context flows through message queues and async operations:
type Message struct {
Payload []byte
Metadata MessageMetadata
}
type MessageMetadata struct {
RequestID string
TraceID string
SpanID string
Timestamp time.Time
}
// Producer: Inject context into message
func (p *Producer) PublishOrder(ctx context.Context, order *Order) error {
reqCtx := getRequestContext(ctx)
msg := &Message{
Payload: serializeOrder(order),
Metadata: MessageMetadata{
RequestID: reqCtx.RequestID,
TraceID: reqCtx.TraceID,
Timestamp: time.Now(),
},
}
// Create span for async operation
span := opentracing.StartSpan(
"publish_order_event",
opentracing.ChildOf(opentracing.SpanFromContext(ctx).Context()),
)
defer span.Finish()
// Inject span context into message
carrier := make(opentracing.TextMapCarrier)
opentracing.GlobalTracer().Inject(
span.Context(),
opentracing.TextMap,
carrier,
)
msg.Metadata.SpanContext = carrier
return p.queue.Publish("orders", msg)
}
// Consumer: Extract context from message
func (c *Consumer) ProcessOrder(msg *Message) error {
// Extract span context
carrier := opentracing.TextMapCarrier(msg.Metadata.SpanContext)
wireContext, _ := opentracing.GlobalTracer().Extract(
opentracing.TextMap,
carrier,
)
// Create span as child of the published span
span := opentracing.StartSpan(
"process_order_event",
ext.SpanKindRPCServer,
opentracing.ChildOf(wireContext),
)
defer span.Finish()
ctx := opentracing.ContextWithSpan(context.Background(), span)
// Reconstruct request context
reqCtx := &RequestContext{
RequestID: msg.Metadata.RequestID,
TraceID: msg.Metadata.TraceID,
}
ctx = context.WithValue(ctx, "request_context", reqCtx)
// Process with full context
logger.Info(ctx, "Processing order from queue", map[string]interface{}{
"message_age": time.Since(msg.Metadata.Timestamp),
})
order, err := deserializeOrder(msg.Payload)
if err != nil {
logger.Error(ctx, "Failed to deserialize order", err, nil)
return err
}
return processOrder(ctx, order)
}
Now async operations show up in your distributed traces, and logs from async processing have the same correlation IDs.
Observability Dashboard
Build a unified dashboard that shows all three pillars:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Metrics (Last 1 hour) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Request Rate: 1,234 req/s β 5% β β
β β Error Rate: 0.2% β anomaly detected β β
β β P95 Latency: 245ms β 2x baseline β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Recent Errors (Logs) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 14:23:45 Payment timeout [trace: xyz-789] βββββββΌββ β
β β 14:23:12 Database connection lost β β β
β β 14:22:58 Invalid user token β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββ€
β Slow Traces β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β Trace xyz-789 (850ms) ββββββββββββββββββββββββββββΌββ β
β β ββ order-service: 245ms β β
β β ββ payment-service: 580ms (timeout) β β
β β ββ notification-service: 25ms β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Click an error in logs, jump to its trace. Click a spike in metrics, see example traces that contributed.
Best Practices
Consistent correlation IDs: Use the same ID scheme across all systems. Donβt have some services using UUIDs and others using integers.
Context propagation everywhere: Donβt break the chain. Every service call, message publish, and background job should propagate context.
Sampling coordination: Sample traces and logs together. If you sample 1% of traces, sample 100% of logs for those traces.
Structured everything: Structured logs, labeled metrics, tagged traces. This enables correlation and querying.
Storage costs: Full observability generates a lot of data. Use sampling, retention policies, and aggregation to manage costs.
Alert correlation: When firing alerts, include links to relevant traces and logs, not just metric graphs.
Looking Forward
The observability landscape is maturing. Weβre seeing:
- OpenTelemetry unifying tracing, metrics, and logs under one standard
- Better automatic instrumentation requiring less manual code
- AI/ML analyzing observability data to detect patterns
- Cheaper storage making full retention more feasible
For modern distributed systems, unified observability isnβt optionalβitβs essential. Metrics alone tell you something is wrong. Traces show you where. Logs explain why. Together, they give you the full picture.
Build correlation in from the start, instrument comprehensively, and create UIs that let you navigate between perspectives. When the 3 AM page comes, youβll be glad you did.