Our microservices architecture grew from 5 services to 40 services in a year. Deployments are faster, teams are autonomous, and we ship features more quickly.
But when something breaks, troubleshooting is a nightmare.
A user reports a slow request. Where do you start? The request touched 12 services, 3 databases, 2 message queues, and 5 third-party APIs. Traditional logging and monitoring don’t cut it anymore.
This is the observability problem in distributed systems.
Over the past six months, we’ve built observability into our architecture. Not just monitoring what we think might fail, but instrumenting systems to answer questions we haven’t thought of yet.
Here’s what observability actually means and how to build it.
Monitoring vs. Observability
First, let’s clarify terms. They’re not the same thing.
Monitoring is watching known failure modes:
- Is the CPU usage high?
- Is the disk full?
- Is the service responding to health checks?
You define metrics upfront. You set alerts for thresholds. This works great for known failure scenarios.
Observability is understanding system behavior by examining outputs:
- Why is this particular user’s request slow?
- Which service is causing latency spikes?
- How does this request flow through the system?
You don’t need to predict failures. The system provides enough data to debug any issue.
The difference:
Monitoring: "The API has high latency"
Observability: "The API is slow because the database query for user 12345's
orders is timing out due to a missing index on the orders table"
Monitoring tells you something is wrong. Observability tells you why.
The Three Pillars of Observability
Observability rests on three pillars: metrics, logs, and traces.
Pillar 1: Metrics
Metrics are numerical measurements over time:
- Request count
- Response time
- Error rate
- CPU usage
- Memory consumption
Metrics are cheap to collect and store. They’re perfect for dashboards and alerting.
// Instrumentation with Prometheus
var (
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latency",
Buckets: []float64{.001, .005, .01, .05, .1, .5, 1, 5},
},
[]string{"method", "endpoint", "status"},
)
requestCount = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
)
func instrumentHandler(handler http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Wrap response writer to capture status
wrapper := &responseWriter{ResponseWriter: w, status: 200}
handler(wrapper, r)
duration := time.Since(start).Seconds()
status := strconv.Itoa(wrapper.status)
requestDuration.WithLabelValues(r.Method, r.URL.Path, status).Observe(duration)
requestCount.WithLabelValues(r.Method, r.URL.Path, status).Inc()
}
}
Key metrics for every service:
- RED metrics: Rate, Errors, Duration
- USE metrics: Utilization, Saturation, Errors
Metrics answer: “What is happening?”
Pillar 2: Logs
Logs are discrete events:
2016-09-22T10:15:30.456Z INFO Request started method=GET path=/api/users/123
2016-09-22T10:15:30.512Z INFO Database query executed query="SELECT * FROM users WHERE id=?" duration=45ms
2016-09-22T10:15:30.558Z ERROR Failed to fetch user user_id=123 error="connection timeout"
Use structured logging:
import "github.com/sirupsen/logrus"
log := logrus.WithFields(logrus.Fields{
"user_id": userID,
"request_id": requestID,
"service": "api",
})
log.Info("Processing user request")
// Later in the code
log.WithFields(logrus.Fields{
"query": query,
"duration": duration,
}).Info("Database query executed")
Structured logs are machine-readable. You can query them:
# Find all slow database queries
query="SELECT * FROM users WHERE id=?" AND duration > 100ms
# Find all errors for a specific user
user_id=123 AND level=ERROR
Logs answer: “What happened and what was the context?”
Pillar 3: Traces
Traces show request flow through distributed systems:
Request ID: abc123
[Frontend] GET /orders → 45ms
↓
[API Gateway] GET /api/orders → 820ms
↓
├─ [Order Service] GET /orders/user/123 → 780ms
│ ↓
│ ├─ [Database] SELECT orders → 15ms
│ ├─ [Product Service] GET /products/456 → 250ms
│ │ └─ [Cache] GET product:456 → 2ms
│ ├─ [Product Service] GET /products/789 → 500ms ⚠️ SLOW
│ │ └─ [Database] SELECT products → 485ms ⚠️ SLOW
│ └─ [User Service] GET /users/123 → 10ms
└─ [Analytics] track_event → 5ms (async)
Traces show exactly where time is spent. In this example, we see that /products/789 is slow because of a database query.
Implementing distributed tracing:
import (
"github.com/opentracing/opentracing-go"
"github.com/uber/jaeger-client-go"
)
// Initialize tracer
func initTracer() opentracing.Tracer {
cfg := jaeger.Configuration{
ServiceName: "order-service",
Sampler: &jaeger.SamplerConfig{
Type: "const",
Param: 1, // Sample all requests
},
Reporter: &jaeger.ReporterConfig{
LogSpans: true,
LocalAgentHostPort: "jaeger:6831",
},
}
tracer, _, _ := cfg.NewTracer()
return tracer
}
// Trace HTTP handlers
func traceHandler(handler http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
// Extract span context from headers
spanCtx, _ := opentracing.GlobalTracer().Extract(
opentracing.HTTPHeaders,
opentracing.HTTPHeadersCarrier(r.Header),
)
// Start new span
span := opentracing.StartSpan(
r.URL.Path,
opentracing.ChildOf(spanCtx),
opentracing.Tag{Key: "http.method", Value: r.Method},
opentracing.Tag{Key: "http.url", Value: r.URL.String()},
)
defer span.Finish()
// Add span to context
ctx := opentracing.ContextWithSpan(r.Context(), span)
handler(w, r.WithContext(ctx))
}
}
// Trace database calls
func queryWithTracing(ctx context.Context, query string) (*sql.Rows, error) {
span, _ := opentracing.StartSpanFromContext(ctx, "database query")
span.SetTag("query", query)
defer span.Finish()
return db.QueryContext(ctx, query)
}
// Trace outbound HTTP calls
func httpCallWithTracing(ctx context.Context, url string) (*http.Response, error) {
span, ctx := opentracing.StartSpanFromContext(ctx, "http call")
span.SetTag("url", url)
defer span.Finish()
req, _ := http.NewRequest("GET", url, nil)
req = req.WithContext(ctx)
// Inject span context into headers
opentracing.GlobalTracer().Inject(
span.Context(),
opentracing.HTTPHeaders,
opentracing.HTTPHeadersCarrier(req.Header),
)
return http.DefaultClient.Do(req)
}
Traces answer: “How did this request flow through the system?”
Correlation: Tying It All Together
The real power comes from correlating metrics, logs, and traces.
Request IDs
Every request gets a unique ID that flows through all services:
func requestIDMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
requestID := r.Header.Get("X-Request-ID")
if requestID == "" {
requestID = uuid.New().String()
}
// Add to response headers
w.Header().Set("X-Request-ID", requestID)
// Add to context
ctx := context.WithValue(r.Context(), "request_id", requestID)
next.ServeHTTP(w, r.WithContext(ctx))
})
}
Propagate request ID to all downstream services:
func callDownstreamService(ctx context.Context, url string) (*http.Response, error) {
req, _ := http.NewRequest("GET", url, nil)
// Propagate request ID
requestID := ctx.Value("request_id").(string)
req.Header.Set("X-Request-ID", requestID)
return http.DefaultClient.Do(req)
}
Include request ID in all logs:
log := logrus.WithField("request_id", requestID)
log.Info("Processing request")
Now you can trace a request across all services:
# Find all logs for request abc123
request_id=abc123
# Find trace for request
trace_id=abc123
Unified Observability
Connect all three pillars:
User Report: "My order page is slow"
↓
Metrics: Which endpoint is slow?
→ /api/orders has p99 latency of 2s (usually 200ms)
↓
Logs: When did this start?
→ High latency started at 10:15 AM
→ Errors from product service increased
↓
Traces: Which requests are slow?
→ Request abc123 took 2.5s
→ Product service call took 2.3s
↓
Logs: What happened in product service?
→ Database query took 2.2s
→ Query: SELECT * FROM products WHERE id=789
↓
Metrics: Is database overloaded?
→ Database CPU is normal
→ Slow query count increased
↓
Root Cause: Query missing index on products table
This investigation took 5 minutes. Without observability, it would take hours.
Practical Implementation Patterns
Pattern 1: Context Propagation
Pass context through the entire request chain:
// HTTP handler receives context
func handleOrder(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
// Context flows to business logic
order, err := getOrder(ctx, orderID)
if err != nil {
log.WithContext(ctx).Error("Failed to get order", err)
return
}
// Context flows to database
products, err := getProducts(ctx, order.ProductIDs)
if err != nil {
log.WithContext(ctx).Error("Failed to get products", err)
return
}
// Context flows to downstream services
user, err := getUserService(ctx, order.UserID)
if err != nil {
log.WithContext(ctx).Error("Failed to get user", err)
return
}
}
// Every function accepts context
func getOrder(ctx context.Context, orderID string) (*Order, error) {
span, ctx := opentracing.StartSpanFromContext(ctx, "get_order")
defer span.Finish()
log := logrus.WithContext(ctx).WithField("order_id", orderID)
log.Info("Fetching order")
// Context flows to database
return db.GetOrder(ctx, orderID)
}
Context carries:
- Request ID
- Trace span
- User information
- Timeouts and cancellation
Everything is automatically correlated.
Pattern 2: Structured Logging with Context
Every log entry has rich context:
type Logger struct {
base *logrus.Entry
}
func (l *Logger) WithContext(ctx context.Context) *logrus.Entry {
entry := l.base
// Add request ID
if requestID := ctx.Value("request_id"); requestID != nil {
entry = entry.WithField("request_id", requestID)
}
// Add trace ID
if span := opentracing.SpanFromContext(ctx); span != nil {
entry = entry.WithField("trace_id", span.Context().(jaeger.SpanContext).TraceID())
}
// Add user ID
if userID := ctx.Value("user_id"); userID != nil {
entry = entry.WithField("user_id", userID)
}
return entry
}
// Usage
log.WithContext(ctx).WithFields(logrus.Fields{
"order_id": orderID,
"amount": order.Amount,
}).Info("Order created")
// Output:
// {
// "level": "info",
// "msg": "Order created",
// "request_id": "abc123",
// "trace_id": "def456",
// "user_id": "789",
// "order_id": "ord_001",
// "amount": 99.99,
// "timestamp": "2016-09-22T10:15:30Z"
// }
Pattern 3: Metrics by Dimension
Tag metrics with dimensions for drill-down:
// Record metrics with tags
metrics.Histogram("api.latency", duration, map[string]string{
"endpoint": r.URL.Path,
"method": r.Method,
"status": strconv.Itoa(status),
"service": "order-service",
"version": appVersion,
})
metrics.Counter("api.errors", 1, map[string]string{
"endpoint": r.URL.Path,
"error_type": errorType,
"service": "order-service",
})
Query by dimension:
# Overall latency
api.latency
# Latency by endpoint
api.latency{endpoint="/api/orders"}
# Latency by endpoint and status
api.latency{endpoint="/api/orders",status="500"}
# Error rate by type
sum(api.errors) by (error_type)
This enables slicing and dicing data for root cause analysis.
Pattern 4: Health Checks with Details
Don’t just return 200/500. Provide details:
type HealthStatus struct {
Status string `json:"status"`
Version string `json:"version"`
Uptime string `json:"uptime"`
Checks map[string]CheckResult `json:"checks"`
}
type CheckResult struct {
Status string `json:"status"`
Message string `json:"message,omitempty"`
Latency string `json:"latency,omitempty"`
}
func healthHandler(w http.ResponseWriter, r *http.Request) {
health := HealthStatus{
Status: "healthy",
Version: appVersion,
Uptime: time.Since(startTime).String(),
Checks: make(map[string]CheckResult),
}
// Check database
dbStart := time.Now()
if err := db.Ping(); err != nil {
health.Checks["database"] = CheckResult{
Status: "unhealthy",
Message: err.Error(),
}
health.Status = "unhealthy"
} else {
health.Checks["database"] = CheckResult{
Status: "healthy",
Latency: time.Since(dbStart).String(),
}
}
// Check Redis
redisStart := time.Now()
if err := redis.Ping().Err(); err != nil {
health.Checks["redis"] = CheckResult{
Status: "unhealthy",
Message: err.Error(),
}
health.Status = "unhealthy"
} else {
health.Checks["redis"] = CheckResult{
Status: "healthy",
Latency: time.Since(redisStart).String(),
}
}
// Check downstream services
serviceStart := time.Now()
if err := checkService("http://product-service/health"); err != nil {
health.Checks["product-service"] = CheckResult{
Status: "unhealthy",
Message: err.Error(),
}
health.Status = "degraded" // Still functional but degraded
} else {
health.Checks["product-service"] = CheckResult{
Status: "healthy",
Latency: time.Since(serviceStart).String(),
}
}
status := http.StatusOK
if health.Status == "unhealthy" {
status = http.StatusServiceUnavailable
}
w.WriteHeader(status)
json.NewEncoder(w).Encode(health)
}
Detailed health checks help diagnose cascading failures.
Distributed System Challenges
Challenge 1: Cascading Failures
One slow service causes everything to slow down:
[Service A] → [Service B] → [Service C]
↓ (slow)
Database
Service C is slow. Service B waits. Service A waits. Everything is slow.
Solution: Circuit Breakers
type CircuitBreaker struct {
maxFailures int
resetTimeout time.Duration
state string // "closed", "open", "half-open"
failures int
lastFailTime time.Time
}
func (cb *CircuitBreaker) Call(fn func() error) error {
if cb.state == "open" {
if time.Since(cb.lastFailTime) > cb.resetTimeout {
cb.state = "half-open"
} else {
return errors.New("circuit breaker open")
}
}
err := fn()
if err != nil {
cb.failures++
cb.lastFailTime = time.Now()
if cb.failures >= cb.maxFailures {
cb.state = "open"
log.Warn("Circuit breaker opened", "failures", cb.failures)
}
return err
}
if cb.state == "half-open" {
cb.state = "closed"
cb.failures = 0
log.Info("Circuit breaker closed")
}
return nil
}
Track circuit breaker state as metrics:
metrics.Gauge("circuit_breaker.state", stateValue, map[string]string{
"service": serviceName,
})
Challenge 2: Sampling in High-Volume Systems
Tracing every request is expensive at scale.
Solution: Smart Sampling
type Sampler struct {
baseRate float64
errorRate float64
slowRate float64
}
func (s *Sampler) ShouldSample(r *http.Request, duration time.Duration, err error) bool {
// Always sample errors
if err != nil {
return rand.Float64() < s.errorRate
}
// Always sample slow requests
if duration > 1*time.Second {
return rand.Float64() < s.slowRate
}
// Sample normal requests at base rate
return rand.Float64() < s.baseRate
}
// Usage
sampler := &Sampler{
baseRate: 0.01, // 1% of normal requests
errorRate: 1.0, // 100% of errors
slowRate: 1.0, // 100% of slow requests
}
This captures interesting requests while reducing data volume.
Challenge 3: Clock Skew
Different servers have different times. Trace timings can be wrong.
Solution: Relative Timestamps
type Span struct {
TraceID string
SpanID string
ParentID string
StartTime time.Time
Duration time.Duration // Relative to start, not wall clock
}
Use duration from span start, not absolute timestamps. This eliminates clock skew issues.
Challenge 4: Cardinality Explosion
Too many metric dimensions creates millions of time series:
// Bad: High cardinality
metrics.Counter("api.requests", 1, map[string]string{
"endpoint": r.URL.Path, // 100 values
"user_id": userID, // 1,000,000 values ⚠️
"session": sessionID, // Infinite values ⚠️
})
Solution: Limit dimensions
// Good: Bounded cardinality
metrics.Counter("api.requests", 1, map[string]string{
"endpoint": r.URL.Path,
"user_tier": getUserTier(userID), // "free", "pro", "enterprise"
"status": strconv.Itoa(status),
})
Use logs/traces for high-cardinality data (user IDs, session IDs).
Building Dashboards
Metrics are useless without visualization.
Service Dashboard Template
Every service gets a standard dashboard:
# Grafana dashboard
dashboard:
title: "Order Service"
panels:
- title: "Request Rate"
metric: "sum(rate(http_requests_total[5m])) by (endpoint)"
- title: "Error Rate"
metric: "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))"
- title: "Latency (p50, p95, p99)"
metrics:
- "histogram_quantile(0.50, http_request_duration_seconds)"
- "histogram_quantile(0.95, http_request_duration_seconds)"
- "histogram_quantile(0.99, http_request_duration_seconds)"
- title: "Dependency Latency"
metric: "histogram_quantile(0.95, dependency_call_duration_seconds) by (service)"
- title: "Circuit Breaker State"
metric: "circuit_breaker_state by (service)"
Every engineer can diagnose their service with the same dashboard layout.
Alerting Best Practices
Alerts should be actionable and non-noisy.
Symptom-Based Alerting
Alert on user-facing symptoms, not internal metrics:
# Bad: Internal metric
- alert: HighCPU
expr: cpu_usage > 80
message: "CPU usage is high"
# So what? Does this affect users?
# Good: User-facing symptom
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) > 0.01
message: "Error rate is {{ $value | humanizePercentage }}, check https://trace.example.com"
severity: critical
SLO-Based Alerting
Alert when SLOs are at risk:
- alert: LatencySLOBreach
expr: |
histogram_quantile(0.95, http_request_duration_seconds) > 0.5
for: 5m
message: "p95 latency {{ $value }}s exceeds SLO of 500ms"
severity: warning
- alert: AvailabilitySLOBreach
expr: |
1 - (sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))) < 0.995
for: 10m
message: "Availability {{ $value | humanizePercentage }} below SLO of 99.5%"
severity: critical
Tools We Use
Metrics:
- Prometheus (collection and storage)
- Grafana (visualization)
- StatsD (client-side)
Logs:
- Elasticsearch (storage and search)
- Logstash/Fluentd (aggregation)
- Kibana (visualization)
Traces:
- Jaeger (distributed tracing)
- Zipkin (alternative to Jaeger)
All-in-One (Alternative):
- Datadog
- New Relic
- Honeycomb
We use open-source tools, but commercial options provide tighter integration.
Practical Rollout Strategy
Phase 1: Instrumentation (Week 1-2)
- Add structured logging to all services
- Instrument HTTP handlers with metrics
- Add health check endpoints
Phase 2: Collection (Week 3-4)
- Deploy Prometheus for metrics
- Deploy ELK stack for logs
- Set up Grafana dashboards
Phase 3: Correlation (Week 5-6)
- Add request ID propagation
- Implement distributed tracing
- Connect metrics, logs, and traces
Phase 4: Alerting (Week 7-8)
- Define SLOs
- Create symptom-based alerts
- Set up on-call rotation
Phase 5: Culture (Ongoing)
- Train engineers on observability tools
- Include observability in code reviews
- Use observability data in post-mortems
Conclusion
Observability is essential for distributed systems. Without it, you’re flying blind.
The key principles:
- Instrument everything: Metrics, logs, and traces
- Correlate data: Request IDs tie everything together
- Focus on symptoms: Alert on user-facing issues
- Make it self-service: Engineers debug their own services
- Continuous improvement: Use observability data to improve the system
Start small:
- Add structured logging
- Instrument HTTP handlers
- Set up basic dashboards
- Add distributed tracing
- Iterate and improve
Observability isn’t a project that finishes—it’s an ongoing practice.
In my next post, I’ll cover GDPR compliance preparation for engineers. Privacy regulations are coming, and engineering teams need to prepare now.
Good observability turns unknown unknowns into known knowns. Invest in it early.