Distributed Tracing with OpenTracing: Making Sense of Microservices

When a single user request flows through fifteen microservices and fails, how do you figure out which service caused the problem? Traditional logging falls apart in distributed systems. You end up grepping through logs across dozens of services, trying to correlate timestamps, and hoping you find the root cause.

I’ve spent the last few months implementing distributed tracing for large-scale microservices, and it’s transformed how we debug production issues. Today, I want to share what I’ve learned about OpenTracing and how to make distributed tracing actually work in practice.

The Problem with Logs

Let’s say a user reports that their checkout failed. You check the logs:

order-service: [2017-04-20 10:23:14] Processing order for user 12345
payment-service: [2017-04-20 10:23:15] Payment failed: timeout
inventory-service: [2017-04-20 10:23:14] Reserved 3 items
shipping-service: [2017-04-20 10:23:16] No order found for user 12345

Which service actually failed? Did the payment timeout cause the problem, or was it a symptom? Why didn’t shipping find the order? The logs don’t tell you the causal chain.

This is where distributed tracing excels. It shows you the complete path of a request through your system, with timing information, error states, and contextual metadata.

What Is Distributed Tracing?

Distributed tracing tracks a single request as it flows through multiple services. Each service records spans—timed operations—that capture what work was performed and how long it took.

A trace is a tree of spans representing the full request path:

Trace: Checkout Request (total: 347ms)
├─ order-service: CreateOrder (203ms)
│  ├─ inventory-service: ReserveItems (45ms)
│  ├─ payment-service: ProcessPayment (150ms) [ERROR: timeout]
│  │  └─ payment-gateway: Charge (timeout after 120ms)
│  └─ notification-service: SendConfirmation (8ms)
└─ shipping-service: ScheduleDelivery (skipped due to error)

Now the problem is obvious: the payment gateway timed out, causing the entire checkout to fail. You can see exactly where the 347ms was spent and which operations succeeded or failed.

OpenTracing: A Standard API

OpenTracing is a vendor-neutral API for distributed tracing. It defines standard interfaces for creating spans, propagating context, and recording metadata. This means you can:

Instrument your code once using OpenTracing
Switch between tracing backends (Jaeger, Zipkin, etc.) without changing application code
Use library instrumentation that works with any tracer

Here’s the core concept: you create spans to represent operations, attach metadata, and propagate context between services.

Basic Instrumentation

Let’s instrument a simple service:

package main

import (
    "context"
    "net/http"
    "github.com/opentracing/opentracing-go"
    "github.com/opentracing/opentracing-go/ext"
)

func handleCheckout(w http.ResponseWriter, r *http.Request) {
    // Extract tracing context from incoming request
    wireContext, _ := opentracing.GlobalTracer().Extract(
        opentracing.HTTPHeaders,
        opentracing.HTTPHeadersCarrier(r.Header),
    )

    // Create a span for this operation
    span := opentracing.StartSpan(
        "checkout.process",
        ext.RPCServerOption(wireContext),
    )
    defer span.Finish()

    // Add metadata to the span
    span.SetTag("user.id", getUserID(r))
    span.SetTag("cart.items", getCartItemCount(r))

    ctx := opentracing.ContextWithSpan(context.Background(), span)

    // Process the checkout
    if err := processOrder(ctx, r); err != nil {
        ext.Error.Set(span, true)
        span.LogKV("error", err.Error())
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }

    w.WriteHeader(http.StatusOK)
}

func processOrder(ctx context.Context, r *http.Request) error {
    // Create a child span for this operation
    span, ctx := opentracing.StartSpanFromContext(ctx, "order.create")
    defer span.Finish()

    // Call other services
    if err := reserveInventory(ctx, r); err != nil {
        return err
    }

    if err := processPayment(ctx, r); err != nil {
        return err
    }

    return nil
}

The key steps are:

Extract context from the incoming request
Create a span for your operation
Add relevant tags and logs
Propagate context to downstream calls
Mark errors explicitly

Propagating Context

The magic of distributed tracing is context propagation. When one service calls another, it must pass the trace context so spans can be connected.

Here’s how to propagate context in HTTP calls:

func reserveInventory(ctx context.Context, items []Item) error {
    span, ctx := opentracing.StartSpanFromContext(ctx, "inventory.reserve")
    defer span.Finish()

    req, _ := http.NewRequest("POST", "http://inventory-service/reserve", body)

    // Inject trace context into HTTP headers
    opentracing.GlobalTracer().Inject(
        span.Context(),
        opentracing.HTTPHeaders,
        opentracing.HTTPHeadersCarrier(req.Header),
    )

    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        ext.Error.Set(span, true)
        span.LogKV("error", err.Error())
        return err
    }
    defer resp.Body.Close()

    if resp.StatusCode != http.StatusOK {
        ext.Error.Set(span, true)
        span.LogKV("http.status_code", resp.StatusCode)
        return fmt.Errorf("reservation failed: %d", resp.StatusCode)
    }

    return nil
}

The Inject call serializes the trace context into HTTP headers (typically uber-trace-id or similar). The receiving service extracts it and links its spans to the same trace.

Tags, Logs, and Baggage

OpenTracing provides three ways to add metadata:

Logs

Logs are timestamped events within a span. Use them for things that happen during the operation:

span.LogKV(
    "event", "cache_miss",
    "cache.key", "user:12345:preferences",
)

span.LogKV(
    "event", "retry_attempt",
    "attempt", 2,
    "error", err.Error(),
)

Logs show up on the timeline, helping you understand the sequence of events within a span.

Baggage

Baggage is key-value data that propagates across service boundaries. Use it sparingly—it’s transmitted with every request:

span.SetBaggageItem("user.id", "12345")
span.SetBaggageItem("request.id", "abc-def-ghi")

Downstream services can read baggage:

userID := span.BaggageItem("user.id")

I use baggage primarily for user IDs and request IDs that need to be accessible everywhere without explicit parameter passing.

Sampling Strategies

Tracing every request in production generates massive amounts of data. Sampling reduces overhead while preserving enough traces to debug issues.

Probabilistic Sampling

Trace a fixed percentage of requests:

// Trace 1% of requests
sampler := jaeger.NewProbabilisticSampler(0.01)

This is simple but can miss rare errors.

Rate Limiting Sampling

Trace up to N requests per second:

// Trace up to 100 requests per second
sampler := jaeger.NewRateLimitingSampler(100)

This ensures predictable trace volume.

Adaptive Sampling

The approach I prefer: always trace errors, sample successes at a lower rate.

func shouldSample(err error, latency time.Duration) bool {
    if err != nil {
        return true // Always trace errors
    }
    if latency > time.Second {
        return true // Always trace slow requests
    }
    return rand.Float64() < 0.01 // Sample 1% of normal requests
}

This gives you comprehensive error traces while keeping volume manageable.

Integration with Frameworks

Most web frameworks have OpenTracing middleware:

import (
    "github.com/opentracing-contrib/go-stdlib/nethttp"
)

func main() {
    mux := http.NewServeMux()
    mux.HandleFunc("/checkout", handleCheckout)

    // Wrap handlers with tracing middleware
    handler := nethttp.Middleware(
        opentracing.GlobalTracer(),
        mux,
        nethttp.OperationNameFunc(func(r *http.Request) string {
            return fmt.Sprintf("%s %s", r.Method, r.URL.Path)
        }),
    )

    http.ListenAndServe(":8080", handler)
}

This automatically creates spans for incoming requests and injects context into outgoing requests.

Database Tracing

Database calls are often the slowest part of request processing. Tracing them is critical:

func getUser(ctx context.Context, userID string) (*User, error) {
    span, ctx := opentracing.StartSpanFromContext(ctx, "db.query.users")
    defer span.Finish()

    span.SetTag("db.type", "postgres")
    span.SetTag("db.instance", "users-db-primary")
    span.SetTag("db.statement", "SELECT * FROM users WHERE id = ?")

    var user User
    err := db.QueryRowContext(ctx, "SELECT * FROM users WHERE id = $1", userID).Scan(
        &user.ID, &user.Name, &user.Email,
    )

    if err != nil {
        ext.Error.Set(span, true)
        span.LogKV("error", err.Error())
        return nil, err
    }

    return &user, nil
}

When you see a slow trace, you can immediately identify which database query is the bottleneck.

Analyzing Traces

Once you’re collecting traces, use them to:

Debug Production Errors

Filter traces by error status and examine the failing span. The tags and logs usually reveal the root cause.

Identify Performance Bottlenecks

Look for traces with high latency and drill into the slowest spans. Often you’ll find:

N+1 queries hitting the database repeatedly
Synchronous calls that could be parallelized
Timeouts waiting for external services

Understand System Behavior

Traces show you how requests actually flow through your system, which often differs from the architecture diagrams.

Best Practices

Name spans clearly: Use descriptive operation names like inventory.reserve not function1.

Tag generously: Add tags for user IDs, tenant IDs, feature flags, cache hits/misses. More metadata means better debugging.

Don’t log sensitive data: Avoid putting passwords, tokens, or PII in span logs.

Propagate context everywhere: If you make an outbound call without propagating context, that part of the trace disappears.

Monitor tracing overhead: Measure the CPU and memory cost of tracing. If it’s more than 1-2%, adjust sampling rates.

Common Pitfalls

Forgetting to finish spans: Always use defer span.Finish() to ensure spans are recorded even if the function panics.

Creating too many spans: Don’t create a span for every tiny function. Focus on meaningful operations like RPC calls, database queries, and business logic steps.

Not sampling: Tracing every request in production will overwhelm your tracing backend and add latency.

Looking Forward

Distributed tracing is becoming essential as systems grow more complex. The ecosystem is maturing rapidly:

Better automatic instrumentation for popular frameworks
Integration with metrics and logs for unified observability
Standards like OpenTracing converging into OpenTelemetry
Improved sampling strategies and tail-based sampling

For any team running microservices, I strongly recommend implementing distributed tracing. The ability to understand request flows and debug production issues makes it indispensable.

Start simple: instrument your HTTP handlers and database calls. Add sampling to keep overhead low. Then gradually expand tracing coverage as you see the value.

When the next production incident happens, you’ll have the visibility to resolve it quickly instead of drowning in disconnected logs.