Debugging distributed systems requires systematic approaches and the right tools. After debugging countless production issues in microservices architectures, I’ve developed methodologies that reduce mean time to resolution.
Distributed Tracing
Trace requests across services:
import "go.opentelemetry.io/otel/trace"
func processRequest(ctx context.Context, req *Request) error {
ctx, span := tracer.Start(ctx, "ProcessRequest")
defer span.End()
span.SetAttributes(
attribute.String("request.id", req.ID),
attribute.String("user.id", req.UserID),
)
// Call downstream services with context
result, err := downstreamService.Call(ctx, req)
if err != nil {
span.RecordError(err)
return err
}
return nil
}
Structured Logging
Correlate logs with traces:
logger.WithFields(log.Fields{
"trace_id": span.SpanContext().TraceID().String(),
"span_id": span.SpanContext().SpanID().String(),
"user_id": userID,
}).Info("Processing request")
Request Flow Visualization
Use Jaeger to visualize request flow and identify latency bottlenecks. Look for:
- Long spans indicating slow operations
- High span counts suggesting N+1 queries
- Error annotations showing failure points
Debugging Checklist
- Check distributed traces for request flow
- Correlate logs using trace IDs
- Examine metrics for anomalies
- Verify network policies and service mesh config
- Test with controlled traffic
- Use chaos engineering to reproduce
Correlation IDs
Implement correlation IDs for end-to-end tracking:
// Middleware to ensure correlation ID
func CorrelationIDMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Extract or generate correlation ID
correlationID := r.Header.Get("X-Correlation-ID")
if correlationID == "" {
correlationID = generateUUID()
}
// Add to response headers
w.Header().Set("X-Correlation-ID", correlationID)
// Add to context
ctx := context.WithValue(r.Context(), "correlation_id", correlationID)
// Continue with correlation ID in context
next.ServeHTTP(w, r.WithContext(ctx))
})
}
// Use in downstream calls
func callDownstream(ctx context.Context, url string) (*Response, error) {
correlationID := ctx.Value("correlation_id").(string)
req, _ := http.NewRequestWithContext(ctx, "GET", url, nil)
req.Header.Set("X-Correlation-ID", correlationID)
return http.DefaultClient.Do(req)
}
Metrics-Driven Debugging
Use metrics to identify anomalies:
# Identify slow endpoints
topk(10,
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) by (endpoint)
)
# Find error hotspots
topk(10,
sum(rate(http_requests_total{code=~"5.."}[5m])) by (endpoint)
)
# Detect traffic spikes
deriv(
sum(rate(http_requests_total[5m]))[10m:]
) > 100
Chaos Engineering for Debugging
Reproduce issues in controlled environments:
// Chaos middleware - inject failures
type ChaosConfig struct {
ErrorRate float64 // Percentage of requests to fail
LatencyMs int // Additional latency to inject
Enabled bool
}
func ChaosMiddleware(config *ChaosConfig) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if !config.Enabled {
next.ServeHTTP(w, r)
return
}
// Inject latency
if config.LatencyMs > 0 {
time.Sleep(time.Duration(config.LatencyMs) * time.Millisecond)
}
// Inject errors
if rand.Float64()*100 < config.ErrorRate {
http.Error(w, "Chaos: Injected Error", http.StatusInternalServerError)
return
}
next.ServeHTTP(w, r)
})
}
}
Enable chaos in staging to validate observability:
# Chaos experiment definition
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay
spec:
action: delay
mode: one
selector:
namespaces:
- staging
labelSelectors:
app: payment-service
delay:
latency: "100ms"
jitter: "50ms"
duration: "5m"
Performance Profiling
Profile production services safely:
import (
"net/http"
_ "net/http/pprof"
"runtime"
)
func enableProfiling() {
// Limit memory for heap profiles
runtime.SetMemProfileRate(1)
// Start pprof server on separate port
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
}
// Access profiles:
// curl http://localhost:6060/debug/pprof/heap > heap.prof
// curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof
// go tool pprof heap.prof
Analyze profiles:
# CPU profile
go tool pprof -http=:8080 cpu.prof
# Memory profile
go tool pprof -http=:8080 heap.prof
# Look for:
# - Hot code paths consuming CPU
# - Memory leaks (growing allocations)
# - Goroutine leaks
Database Query Analysis
Debug slow database operations:
// Query logging middleware
type QueryLogger struct {
db *sql.DB
logger *log.Logger
}
func (ql *QueryLogger) QueryContext(ctx context.Context, query string, args ...interface{}) (*sql.Rows, error) {
start := time.Now()
rows, err := ql.db.QueryContext(ctx, query, args...)
duration := time.Since(start)
// Log slow queries
if duration > 100*time.Millisecond {
ql.logger.Printf("SLOW QUERY (%v): %s [%v]", duration, query, args)
// Include stack trace for analysis
buf := make([]byte, 4096)
n := runtime.Stack(buf, false)
ql.logger.Printf("Stack: %s", buf[:n])
}
return rows, err
}
Use EXPLAIN for query optimization:
-- Analyze query plan
EXPLAIN ANALYZE
SELECT o.id, o.total, u.name
FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.created_at > NOW() - INTERVAL '7 days'
ORDER BY o.created_at DESC
LIMIT 100;
-- Look for:
-- - Sequential scans (need indexes)
-- - High cost operations
-- - Missing indexes on JOIN/WHERE columns
Network Debugging
Analyze service-to-service communication:
# Capture traffic between pods
kubectl sniff -n production pod/api-7d9f8c-xyz -f "port 8080"
# Analyze with tcpdump
tcpdump -i any -nn port 8080 -w capture.pcap
# View HTTP requests
tcpdump -i any -nn -A port 8080 | grep -E "GET|POST|PUT|DELETE"
Service mesh provides built-in traffic visibility:
# Istio traffic inspection
istioctl dashboard envoy pod/api-7d9f8c-xyz
# View configuration
istioctl proxy-config routes pod/api-7d9f8c-xyz
# Check mTLS status
istioctl authn tls-check pod/api-7d9f8c-xyz
Memory Leak Detection
Identify and resolve memory leaks:
// Track allocations
type MemoryTracker struct {
samples []runtime.MemStats
mu sync.Mutex
}
func (mt *MemoryTracker) Sample() {
mt.mu.Lock()
defer mt.mu.Unlock()
var m runtime.MemStats
runtime.ReadMemStats(&m)
mt.samples = append(mt.samples, m)
// Alert if memory consistently growing
if len(mt.samples) > 10 {
recent := mt.samples[len(mt.samples)-10:]
if mt.isGrowing(recent) {
log.Println("WARNING: Memory leak detected")
mt.dumpHeapProfile()
}
}
}
func (mt *MemoryTracker) isGrowing(samples []runtime.MemStats) bool {
// Check if Alloc is consistently increasing
for i := 1; i < len(samples); i++ {
if samples[i].Alloc <= samples[i-1].Alloc {
return false
}
}
return true
}
func (mt *MemoryTracker) dumpHeapProfile() {
f, err := os.Create("/tmp/heap.prof")
if err != nil {
return
}
defer f.Close()
pprof.WriteHeapProfile(f)
}
Distributed System Patterns for Debugging
Circuit Breaker Insights
Monitor circuit breaker state:
type CircuitBreaker struct {
name string
state State // Open, HalfOpen, Closed
failures int
lastStateChange time.Time
metrics MetricsCollector
}
func (cb *CircuitBreaker) Call(ctx context.Context, fn func() error) error {
cb.metrics.RecordAttempt(cb.name, cb.state)
if cb.state == Open {
if time.Since(cb.lastStateChange) > cb.timeout {
cb.state = HalfOpen
cb.metrics.RecordStateChange(cb.name, "open", "half_open")
} else {
cb.metrics.RecordRejection(cb.name)
return ErrCircuitOpen
}
}
err := fn()
if err != nil {
cb.failures++
cb.metrics.RecordFailure(cb.name, err)
if cb.failures >= cb.threshold {
cb.state = Open
cb.lastStateChange = time.Now()
cb.metrics.RecordStateChange(cb.name, "closed", "open")
}
} else {
cb.failures = 0
if cb.state == HalfOpen {
cb.state = Closed
cb.metrics.RecordStateChange(cb.name, "half_open", "closed")
}
}
return err
}
Bulkhead Pattern Monitoring
Track resource pool utilization:
type Bulkhead struct {
name string
semaphore chan struct{}
metrics MetricsCollector
}
func NewBulkhead(name string, maxConcurrent int) *Bulkhead {
return &Bulkhead{
name: name,
semaphore: make(chan struct{}, maxConcurrent),
metrics: NewMetricsCollector(),
}
}
func (b *Bulkhead) Execute(ctx context.Context, fn func() error) error {
select {
case b.semaphore <- struct{}{}:
defer func() { <-b.semaphore }()
b.metrics.RecordExecution(b.name, len(b.semaphore))
return fn()
case <-ctx.Done():
b.metrics.RecordRejection(b.name, "timeout")
return ctx.Err()
default:
b.metrics.RecordRejection(b.name, "capacity")
return ErrBulkheadFull
}
}
Log Aggregation Best Practices
Structure logs for effective searching:
type StructuredLogger struct {
logger *zap.Logger
}
func (sl *StructuredLogger) LogRequest(ctx context.Context, req *Request, resp *Response, err error) {
correlationID := ctx.Value("correlation_id").(string)
userID := ctx.Value("user_id").(string)
fields := []zap.Field{
zap.String("correlation_id", correlationID),
zap.String("user_id", userID),
zap.String("method", req.Method),
zap.String("path", req.Path),
zap.Int("status", resp.StatusCode),
zap.Duration("duration", resp.Duration),
}
if err != nil {
fields = append(fields,
zap.Error(err),
zap.String("error_type", fmt.Sprintf("%T", err)),
)
sl.logger.Error("Request failed", fields...)
} else {
sl.logger.Info("Request completed", fields...)
}
}
Query logs effectively:
# Find all errors for a correlation ID
correlation_id:"abc-123" AND level:error
# Find slow requests
duration:>1000 AND path:"/api/orders"
# Find specific error types
error_type:"DatabaseError" AND service:"payment"
Debugging Checklist
Systematic approach to debugging distributed systems:
- Identify the symptom: What is failing? For whom?
- Check recent changes: Deployments, config changes, infrastructure
- Review metrics: Golden signals (latency, traffic, errors, saturation)
- Examine traces: Find slow spans, errors in request flow
- Correlate logs: Use trace/correlation IDs to find related logs
- Inspect dependencies: Are downstream services healthy?
- Test hypothesis: Use chaos engineering or synthetic tests
- Implement fix: Deploy to staging first
- Verify resolution: Monitor metrics, traces, and logs
- Document learnings: Update runbooks and postmortems
Conclusion
Debugging distributed systems requires:
- Distributed tracing for request flow visibility
- Structured logging with correlation IDs
- Comprehensive metrics for anomaly detection
- Systematic methodology rather than random changes
- Chaos engineering to validate observability
- Performance profiling for resource optimization
- Network analysis for communication issues
- Circuit breaker and bulkhead patterns with monitoring
Build observability into systems from the start. When issues occur, use a systematic approach: gather data from traces, logs, and metrics; form hypotheses; test them; and document learnings for future incidents.
The goal is not just to fix the immediate issue, but to improve your system’s observability and resilience so the next issue is easier and faster to debug. Every incident is an opportunity to strengthen your debugging capabilities and system design.