Building Observability into Production Systems

After debugging too many production issues with insufficient data, I’ve become obsessed with observability. The ability to understand what your system is doing in production is the difference between quick resolution and days of guesswork. Here’s how I built comprehensive observability into FC-Redirect.

The Three Pillars of Observability

Observability rests on three foundations:

Metrics: Numerical measurements over time
Logs: Discrete events with context
Traces: Request flows through distributed components

Let me show how I implemented each in FC-Redirect.

Metrics: Measuring What Matters

Identifying Key Metrics

Not everything needs to be measured. I focus on:

The Four Golden Signals (from Google SRE):

Latency: How long do operations take?
Traffic: How many requests are we serving?
Errors: What’s failing?
Saturation: How full are we?

FC-Redirect Specific Metrics:

typedef struct fc_redirect_metrics {
    // Latency
    histogram_t packet_latency;
    histogram_t flow_lookup_latency;
    histogram_t redirect_decision_latency;

    // Traffic
    atomic_uint64_t packets_per_second;
    atomic_uint64_t bytes_per_second;
    atomic_uint64_t flows_per_second;

    // Errors
    atomic_uint64_t lookup_failures;
    atomic_uint64_t redirect_failures;
    atomic_uint64_t queue_overflows;

    // Saturation
    atomic_uint64_t flow_table_utilization;  // 0-100%
    atomic_uint64_t cpu_utilization;         // 0-100%
    atomic_uint64_t memory_utilization;      // 0-100%
} fc_redirect_metrics_t;

Efficient Metric Collection

Metrics must be cheap to collect:

// Per-thread counters (no atomic operations!)
__thread struct {
    uint64_t packets_processed;
    uint64_t bytes_processed;
    uint64_t lookups_performed;
} thread_counters = {0};

static inline void record_packet_processed(size_t bytes) {
    thread_counters.packets_processed++;
    thread_counters.bytes_processed += bytes;
}

// Periodic aggregation
void aggregate_metrics() {
    uint64_t total_packets = 0;
    uint64_t total_bytes = 0;

    for (int i = 0; i < num_threads; i++) {
        total_packets += thread_counters[i].packets_processed;
        total_bytes += thread_counters[i].bytes_processed;

        // Reset for next period
        thread_counters[i].packets_processed = 0;
        thread_counters[i].bytes_processed = 0;
    }

    // Export aggregated metrics
    export_metric("packets_per_second", total_packets);
    export_metric("bytes_per_second", total_bytes);
}

Per-thread counters eliminate atomic overhead. We aggregate periodically (every second) for export.

Latency Histograms

Averages lie. Always use histograms:

typedef struct histogram {
    atomic_uint64_t buckets[NUM_BUCKETS];
    uint64_t bucket_boundaries[NUM_BUCKETS];
} histogram_t;

void histogram_init(histogram_t *h) {
    // Exponential buckets: 1μs, 2μs, 4μs, 8μs, ...
    for (int i = 0; i < NUM_BUCKETS; i++) {
        h->bucket_boundaries[i] = (1ULL << i);  // Powers of 2
        atomic_store(&h->buckets[i], 0);
    }
}

void histogram_record(histogram_t *h, uint64_t value_ns) {
    // Binary search for bucket
    int bucket = 0;
    while (bucket < NUM_BUCKETS - 1 &&
           value_ns >= h->bucket_boundaries[bucket + 1]) {
        bucket++;
    }

    atomic_fetch_add(&h->buckets[bucket], 1);
}

double histogram_percentile(histogram_t *h, double percentile) {
    uint64_t total = 0;
    for (int i = 0; i < NUM_BUCKETS; i++) {
        total += atomic_load(&h->buckets[i]);
    }

    uint64_t target = (uint64_t)(total * percentile / 100.0);
    uint64_t cumulative = 0;

    for (int i = 0; i < NUM_BUCKETS; i++) {
        cumulative += atomic_load(&h->buckets[i]);
        if (cumulative >= target) {
            return h->bucket_boundaries[i];
        }
    }

    return h->bucket_boundaries[NUM_BUCKETS - 1];
}

Now I can calculate P50, P99, P999 latencies accurately.

Metric Export

I export metrics via a simple HTTP endpoint:

void handle_metrics_request(http_request_t *req) {
    char *response = malloc(64*1024);
    size_t offset = 0;

    // Prometheus format
    offset += sprintf(response + offset,
                     "# HELP packets_per_second Packet processing rate\n");
    offset += sprintf(response + offset,
                     "# TYPE packets_per_second gauge\n");
    offset += sprintf(response + offset,
                     "packets_per_second %lu\n",
                     get_packets_per_second());

    offset += sprintf(response + offset,
                     "# HELP flow_lookup_latency_p99 P99 flow lookup latency (ns)\n");
    offset += sprintf(response + offset,
                     "# TYPE flow_lookup_latency_p99 gauge\n");
    offset += sprintf(response + offset,
                     "flow_lookup_latency_p99 %.0f\n",
                     histogram_percentile(&latency_histogram, 99.0));

    // ... more metrics ...

    http_send_response(req, response, offset);
    free(response);
}

This integrates seamlessly with Prometheus for scraping and Grafana for visualization.

Logs: Context for Understanding

Structured Logging

I use structured logging for machine-parseable logs:

typedef enum {
    LOG_DEBUG,
    LOG_INFO,
    LOG_WARN,
    LOG_ERROR,
    LOG_CRITICAL
} log_level_t;

void log_structured(log_level_t level, const char *event,
                   const char *key1, const char *val1, ...) {
    // JSON format for structured logging
    fprintf(stderr, "{\"timestamp\":\"%lu\",\"level\":\"%s\",\"event\":\"%s\"",
            time_microseconds(), log_level_str(level), event);

    va_list args;
    va_start(args, val1);

    const char *key = key1;
    const char *val = val1;

    while (key != NULL) {
        fprintf(stderr, ",\"%s\":\"%s\"", key, val);
        key = va_arg(args, const char*);
        if (key) val = va_arg(args, const char*);
    }

    va_end(args);
    fprintf(stderr, "}\n");
}

// Usage
log_structured(LOG_INFO, "flow_created",
               "flow_key", flow_key_str,
               "source_wwpn", src_wwpn_str,
               "dest_wwpn", dst_wwpn_str,
               NULL);

Structured logs can be indexed and searched efficiently.

Log Levels and Sampling

Not everything should be logged at INFO:

#define LOG_DEBUG_SAMPLED(fmt, ...) \
    do { \
        if (should_sample_debug_log()) { \
            log_debug(fmt, ##__VA_ARGS__); \
        } \
    } while(0)

static bool should_sample_debug_log() {
    static atomic_uint64_t counter = 0;
    uint64_t val = atomic_fetch_add(&counter, 1);

    // Log 1 in 1000 debug events
    return (val % 1000) == 0;
}

// Usage in hot path
LOG_DEBUG_SAMPLED("Processing packet flow=%016lx size=%u",
                  flow_key, packet->size);

Sampling prevents log volume from overwhelming the system while still providing visibility.

Ring Buffer Logging

For post-mortem debugging, I keep a ring buffer of recent events:

#define RING_LOG_SIZE (1024 * 1024)

typedef struct ring_log_entry {
    uint64_t timestamp;
    uint32_t thread_id;
    char message[256];
} ring_log_entry_t;

typedef struct ring_log {
    ring_log_entry_t entries[RING_LOG_SIZE];
    atomic_uint64_t index;
} ring_log_t;

ring_log_t global_ring_log = {0};

void ring_log(const char *fmt, ...) {
    uint64_t idx = atomic_fetch_add(&global_ring_log.index, 1);
    ring_log_entry_t *entry = &global_ring_log.entries[idx % RING_LOG_SIZE];

    entry->timestamp = rdtsc();
    entry->thread_id = gettid();

    va_list args;
    va_start(args, fmt);
    vsnprintf(entry->message, sizeof(entry->message), fmt, args);
    va_end(args);
}

// Dump ring log on crash
void signal_handler(int sig) {
    fprintf(stderr, "=== CRASH RING LOG DUMP ===\n");

    uint64_t current = atomic_load(&global_ring_log.index);
    uint64_t start = current > RING_LOG_SIZE ? current - RING_LOG_SIZE : 0;

    for (uint64_t i = start; i < current; i++) {
        ring_log_entry_t *entry =
            &global_ring_log.entries[i % RING_LOG_SIZE];

        fprintf(stderr, "[%lu] tid=%u: %s\n",
                entry->timestamp, entry->thread_id, entry->message);
    }

    abort();
}

When crashes occur, we dump the last million log entries from the ring buffer.

Traces: Following Request Flows

Distributed Tracing

For distributed systems, I implement request tracing:

typedef struct trace_context {
    uint64_t trace_id;
    uint64_t span_id;
    uint64_t parent_span_id;
    uint64_t start_time;
} trace_context_t;

// Start a new trace
trace_context_t trace_start(const char *operation) {
    trace_context_t ctx = {
        .trace_id = generate_trace_id(),
        .span_id = generate_span_id(),
        .parent_span_id = 0,
        .start_time = time_microseconds()
    };

    log_structured(LOG_DEBUG, "trace_start",
                   "trace_id", trace_id_str(ctx.trace_id),
                   "span_id", span_id_str(ctx.span_id),
                   "operation", operation,
                   NULL);

    return ctx;
}

// Create child span
trace_context_t trace_child_span(trace_context_t *parent,
                                 const char *operation) {
    trace_context_t ctx = {
        .trace_id = parent->trace_id,
        .span_id = generate_span_id(),
        .parent_span_id = parent->span_id,
        .start_time = time_microseconds()
    };

    log_structured(LOG_DEBUG, "span_start",
                   "trace_id", trace_id_str(ctx.trace_id),
                   "span_id", span_id_str(ctx.span_id),
                   "parent_span_id", span_id_str(ctx.parent_span_id),
                   "operation", operation,
                   NULL);

    return ctx;
}

// End span
void trace_end(trace_context_t *ctx) {
    uint64_t duration = time_microseconds() - ctx->start_time;

    log_structured(LOG_DEBUG, "span_end",
                   "trace_id", trace_id_str(ctx->trace_id),
                   "span_id", span_id_str(ctx->span_id),
                   "duration_us", uint64_str(duration),
                   NULL);
}

// Usage
void process_request(packet_t *pkt) {
    trace_context_t trace = trace_start("process_packet");

    trace_context_t lookup_span = trace_child_span(&trace, "flow_lookup");
    flow_entry_t *flow = lookup_flow(pkt->flow_key);
    trace_end(&lookup_span);

    trace_context_t policy_span = trace_child_span(&trace, "apply_policy");
    apply_policy(pkt, flow);
    trace_end(&policy_span);

    trace_end(&trace);
}

This generates traces I can visualize in tools like Jaeger or Zipkin.

Alerting: Turning Data into Action

Metrics are useless without alerts:

typedef struct alert_rule {
    const char *name;
    metric_getter_fn get_metric;
    alert_condition_t condition;
    double threshold;
    alert_severity_t severity;
} alert_rule_t;

alert_rule_t alert_rules[] = {
    {
        .name = "High Packet Loss",
        .get_metric = get_packet_loss_rate,
        .condition = CONDITION_GREATER_THAN,
        .threshold = 0.01,  // 1%
        .severity = SEVERITY_CRITICAL
    },
    {
        .name = "Flow Table Near Full",
        .get_metric = get_flow_table_utilization,
        .condition = CONDITION_GREATER_THAN,
        .threshold = 0.85,  // 85%
        .severity = SEVERITY_WARNING
    },
    // ... more rules ...
};

void evaluate_alerts() {
    for (int i = 0; i < num_alert_rules; i++) {
        alert_rule_t *rule = &alert_rules[i];
        double value = rule->get_metric();

        bool triggered = false;
        switch (rule->condition) {
        case CONDITION_GREATER_THAN:
            triggered = value > rule->threshold;
            break;
        case CONDITION_LESS_THAN:
            triggered = value < rule->threshold;
            break;
        }

        if (triggered) {
            fire_alert(rule->name, value, rule->severity);
        }
    }
}

Dashboard: Visualizing System Health

I created a real-time dashboard using Grafana:

Panel 1: Traffic

Packets per second (line graph)
Bytes per second (line graph)
Top 10 flows by bandwidth (bar chart)

Panel 2: Latency

P50, P99, P999 latencies (line graphs)
Latency histogram (heatmap)

Panel 3: Errors

Error rate (line graph)
Error types (pie chart)
Recent errors (table)

Panel 4: Resources

CPU utilization (gauge)
Memory utilization (gauge)
Flow table utilization (gauge)

Real-World Examples

Example 1: Memory Leak Detection

Monitoring revealed gradual memory growth:

Memory utilization over 7 days:
Day 0: 45%
Day 1: 48%
Day 2: 51%
Day 3: 54%
...

Without monitoring, this would have eventually caused OOM. With monitoring, I detected and fixed the leak early.

Example 2: Performance Regression

After a deployment, latency metrics showed:

P99 latency:
Before: 2.0μs
After:  3.5μs  (+75%!)

Immediate rollback and investigation revealed a bad optimization. Without metrics, users would have complained (maybe).

Example 3: Capacity Planning

Traffic metrics showed growth trend:

Packets per second:
Jan: 2.1M
Feb: 2.3M (+9%)
Mar: 2.5M (+9%)
Projected: 3.2M by June

This data justified hardware capacity planning six months in advance.

Best Practices

From two years of building observability:

1. Instrument from Day One

Don’t add observability after the fact. Build it in from the start.

2. Make Metrics Cheap

If metrics add >1% overhead, you’ll disable them in production. Keep them lightweight.

3. Sample Aggressively

You don’t need every log line. Sample debug logs, full traces, and verbose events.

4. Store Metrics, Not Logs

Metrics compress better and query faster than logs. Use logs for context, metrics for trends.

5. Alert on Symptoms, Not Causes

Alert when users are impacted (high latency, errors), not on proxies (CPU usage).

6. Reduce Alert Fatigue

False positives destroy trust. Tune thresholds carefully and use alert escalation.

Conclusion

Observability is not optional for production systems. You cannot fix what you cannot see. The investment in comprehensive monitoring pays for itself the first time you debug a production issue in minutes instead of days.

The three pillars work together:

Metrics tell you what is wrong
Logs tell you why it’s wrong
Traces tell you where it’s wrong

Build all three into your systems from the beginning. Your future self will thank you.

In FC-Redirect, observability has:

Reduced mean time to detection from hours to seconds
Reduced mean time to resolution from days to minutes
Enabled proactive capacity planning
Prevented outages through early warning

Observability is not a feature. It’s a requirement.