After debugging too many production issues with insufficient data, I’ve become obsessed with observability. The ability to understand what your system is doing in production is the difference between quick resolution and days of guesswork. Here’s how I built comprehensive observability into FC-Redirect.
The Three Pillars of Observability
Observability rests on three foundations:
- Metrics: Numerical measurements over time
- Logs: Discrete events with context
- Traces: Request flows through distributed components
Let me show how I implemented each in FC-Redirect.
Metrics: Measuring What Matters
Identifying Key Metrics
Not everything needs to be measured. I focus on:
The Four Golden Signals (from Google SRE):
- Latency: How long do operations take?
- Traffic: How many requests are we serving?
- Errors: What’s failing?
- Saturation: How full are we?
FC-Redirect Specific Metrics:
typedef struct fc_redirect_metrics {
// Latency
histogram_t packet_latency;
histogram_t flow_lookup_latency;
histogram_t redirect_decision_latency;
// Traffic
atomic_uint64_t packets_per_second;
atomic_uint64_t bytes_per_second;
atomic_uint64_t flows_per_second;
// Errors
atomic_uint64_t lookup_failures;
atomic_uint64_t redirect_failures;
atomic_uint64_t queue_overflows;
// Saturation
atomic_uint64_t flow_table_utilization; // 0-100%
atomic_uint64_t cpu_utilization; // 0-100%
atomic_uint64_t memory_utilization; // 0-100%
} fc_redirect_metrics_t;
Efficient Metric Collection
Metrics must be cheap to collect:
// Per-thread counters (no atomic operations!)
__thread struct {
uint64_t packets_processed;
uint64_t bytes_processed;
uint64_t lookups_performed;
} thread_counters = {0};
static inline void record_packet_processed(size_t bytes) {
thread_counters.packets_processed++;
thread_counters.bytes_processed += bytes;
}
// Periodic aggregation
void aggregate_metrics() {
uint64_t total_packets = 0;
uint64_t total_bytes = 0;
for (int i = 0; i < num_threads; i++) {
total_packets += thread_counters[i].packets_processed;
total_bytes += thread_counters[i].bytes_processed;
// Reset for next period
thread_counters[i].packets_processed = 0;
thread_counters[i].bytes_processed = 0;
}
// Export aggregated metrics
export_metric("packets_per_second", total_packets);
export_metric("bytes_per_second", total_bytes);
}
Per-thread counters eliminate atomic overhead. We aggregate periodically (every second) for export.
Latency Histograms
Averages lie. Always use histograms:
typedef struct histogram {
atomic_uint64_t buckets[NUM_BUCKETS];
uint64_t bucket_boundaries[NUM_BUCKETS];
} histogram_t;
void histogram_init(histogram_t *h) {
// Exponential buckets: 1μs, 2μs, 4μs, 8μs, ...
for (int i = 0; i < NUM_BUCKETS; i++) {
h->bucket_boundaries[i] = (1ULL << i); // Powers of 2
atomic_store(&h->buckets[i], 0);
}
}
void histogram_record(histogram_t *h, uint64_t value_ns) {
// Binary search for bucket
int bucket = 0;
while (bucket < NUM_BUCKETS - 1 &&
value_ns >= h->bucket_boundaries[bucket + 1]) {
bucket++;
}
atomic_fetch_add(&h->buckets[bucket], 1);
}
double histogram_percentile(histogram_t *h, double percentile) {
uint64_t total = 0;
for (int i = 0; i < NUM_BUCKETS; i++) {
total += atomic_load(&h->buckets[i]);
}
uint64_t target = (uint64_t)(total * percentile / 100.0);
uint64_t cumulative = 0;
for (int i = 0; i < NUM_BUCKETS; i++) {
cumulative += atomic_load(&h->buckets[i]);
if (cumulative >= target) {
return h->bucket_boundaries[i];
}
}
return h->bucket_boundaries[NUM_BUCKETS - 1];
}
Now I can calculate P50, P99, P999 latencies accurately.
Metric Export
I export metrics via a simple HTTP endpoint:
void handle_metrics_request(http_request_t *req) {
char *response = malloc(64*1024);
size_t offset = 0;
// Prometheus format
offset += sprintf(response + offset,
"# HELP packets_per_second Packet processing rate\n");
offset += sprintf(response + offset,
"# TYPE packets_per_second gauge\n");
offset += sprintf(response + offset,
"packets_per_second %lu\n",
get_packets_per_second());
offset += sprintf(response + offset,
"# HELP flow_lookup_latency_p99 P99 flow lookup latency (ns)\n");
offset += sprintf(response + offset,
"# TYPE flow_lookup_latency_p99 gauge\n");
offset += sprintf(response + offset,
"flow_lookup_latency_p99 %.0f\n",
histogram_percentile(&latency_histogram, 99.0));
// ... more metrics ...
http_send_response(req, response, offset);
free(response);
}
This integrates seamlessly with Prometheus for scraping and Grafana for visualization.
Logs: Context for Understanding
Structured Logging
I use structured logging for machine-parseable logs:
typedef enum {
LOG_DEBUG,
LOG_INFO,
LOG_WARN,
LOG_ERROR,
LOG_CRITICAL
} log_level_t;
void log_structured(log_level_t level, const char *event,
const char *key1, const char *val1, ...) {
// JSON format for structured logging
fprintf(stderr, "{\"timestamp\":\"%lu\",\"level\":\"%s\",\"event\":\"%s\"",
time_microseconds(), log_level_str(level), event);
va_list args;
va_start(args, val1);
const char *key = key1;
const char *val = val1;
while (key != NULL) {
fprintf(stderr, ",\"%s\":\"%s\"", key, val);
key = va_arg(args, const char*);
if (key) val = va_arg(args, const char*);
}
va_end(args);
fprintf(stderr, "}\n");
}
// Usage
log_structured(LOG_INFO, "flow_created",
"flow_key", flow_key_str,
"source_wwpn", src_wwpn_str,
"dest_wwpn", dst_wwpn_str,
NULL);
Structured logs can be indexed and searched efficiently.
Log Levels and Sampling
Not everything should be logged at INFO:
#define LOG_DEBUG_SAMPLED(fmt, ...) \
do { \
if (should_sample_debug_log()) { \
log_debug(fmt, ##__VA_ARGS__); \
} \
} while(0)
static bool should_sample_debug_log() {
static atomic_uint64_t counter = 0;
uint64_t val = atomic_fetch_add(&counter, 1);
// Log 1 in 1000 debug events
return (val % 1000) == 0;
}
// Usage in hot path
LOG_DEBUG_SAMPLED("Processing packet flow=%016lx size=%u",
flow_key, packet->size);
Sampling prevents log volume from overwhelming the system while still providing visibility.
Ring Buffer Logging
For post-mortem debugging, I keep a ring buffer of recent events:
#define RING_LOG_SIZE (1024 * 1024)
typedef struct ring_log_entry {
uint64_t timestamp;
uint32_t thread_id;
char message[256];
} ring_log_entry_t;
typedef struct ring_log {
ring_log_entry_t entries[RING_LOG_SIZE];
atomic_uint64_t index;
} ring_log_t;
ring_log_t global_ring_log = {0};
void ring_log(const char *fmt, ...) {
uint64_t idx = atomic_fetch_add(&global_ring_log.index, 1);
ring_log_entry_t *entry = &global_ring_log.entries[idx % RING_LOG_SIZE];
entry->timestamp = rdtsc();
entry->thread_id = gettid();
va_list args;
va_start(args, fmt);
vsnprintf(entry->message, sizeof(entry->message), fmt, args);
va_end(args);
}
// Dump ring log on crash
void signal_handler(int sig) {
fprintf(stderr, "=== CRASH RING LOG DUMP ===\n");
uint64_t current = atomic_load(&global_ring_log.index);
uint64_t start = current > RING_LOG_SIZE ? current - RING_LOG_SIZE : 0;
for (uint64_t i = start; i < current; i++) {
ring_log_entry_t *entry =
&global_ring_log.entries[i % RING_LOG_SIZE];
fprintf(stderr, "[%lu] tid=%u: %s\n",
entry->timestamp, entry->thread_id, entry->message);
}
abort();
}
When crashes occur, we dump the last million log entries from the ring buffer.
Traces: Following Request Flows
Distributed Tracing
For distributed systems, I implement request tracing:
typedef struct trace_context {
uint64_t trace_id;
uint64_t span_id;
uint64_t parent_span_id;
uint64_t start_time;
} trace_context_t;
// Start a new trace
trace_context_t trace_start(const char *operation) {
trace_context_t ctx = {
.trace_id = generate_trace_id(),
.span_id = generate_span_id(),
.parent_span_id = 0,
.start_time = time_microseconds()
};
log_structured(LOG_DEBUG, "trace_start",
"trace_id", trace_id_str(ctx.trace_id),
"span_id", span_id_str(ctx.span_id),
"operation", operation,
NULL);
return ctx;
}
// Create child span
trace_context_t trace_child_span(trace_context_t *parent,
const char *operation) {
trace_context_t ctx = {
.trace_id = parent->trace_id,
.span_id = generate_span_id(),
.parent_span_id = parent->span_id,
.start_time = time_microseconds()
};
log_structured(LOG_DEBUG, "span_start",
"trace_id", trace_id_str(ctx.trace_id),
"span_id", span_id_str(ctx.span_id),
"parent_span_id", span_id_str(ctx.parent_span_id),
"operation", operation,
NULL);
return ctx;
}
// End span
void trace_end(trace_context_t *ctx) {
uint64_t duration = time_microseconds() - ctx->start_time;
log_structured(LOG_DEBUG, "span_end",
"trace_id", trace_id_str(ctx->trace_id),
"span_id", span_id_str(ctx->span_id),
"duration_us", uint64_str(duration),
NULL);
}
// Usage
void process_request(packet_t *pkt) {
trace_context_t trace = trace_start("process_packet");
trace_context_t lookup_span = trace_child_span(&trace, "flow_lookup");
flow_entry_t *flow = lookup_flow(pkt->flow_key);
trace_end(&lookup_span);
trace_context_t policy_span = trace_child_span(&trace, "apply_policy");
apply_policy(pkt, flow);
trace_end(&policy_span);
trace_end(&trace);
}
This generates traces I can visualize in tools like Jaeger or Zipkin.
Alerting: Turning Data into Action
Metrics are useless without alerts:
typedef struct alert_rule {
const char *name;
metric_getter_fn get_metric;
alert_condition_t condition;
double threshold;
alert_severity_t severity;
} alert_rule_t;
alert_rule_t alert_rules[] = {
{
.name = "High Packet Loss",
.get_metric = get_packet_loss_rate,
.condition = CONDITION_GREATER_THAN,
.threshold = 0.01, // 1%
.severity = SEVERITY_CRITICAL
},
{
.name = "Flow Table Near Full",
.get_metric = get_flow_table_utilization,
.condition = CONDITION_GREATER_THAN,
.threshold = 0.85, // 85%
.severity = SEVERITY_WARNING
},
// ... more rules ...
};
void evaluate_alerts() {
for (int i = 0; i < num_alert_rules; i++) {
alert_rule_t *rule = &alert_rules[i];
double value = rule->get_metric();
bool triggered = false;
switch (rule->condition) {
case CONDITION_GREATER_THAN:
triggered = value > rule->threshold;
break;
case CONDITION_LESS_THAN:
triggered = value < rule->threshold;
break;
}
if (triggered) {
fire_alert(rule->name, value, rule->severity);
}
}
}
Dashboard: Visualizing System Health
I created a real-time dashboard using Grafana:
Panel 1: Traffic
- Packets per second (line graph)
- Bytes per second (line graph)
- Top 10 flows by bandwidth (bar chart)
Panel 2: Latency
- P50, P99, P999 latencies (line graphs)
- Latency histogram (heatmap)
Panel 3: Errors
- Error rate (line graph)
- Error types (pie chart)
- Recent errors (table)
Panel 4: Resources
- CPU utilization (gauge)
- Memory utilization (gauge)
- Flow table utilization (gauge)
Real-World Examples
Example 1: Memory Leak Detection
Monitoring revealed gradual memory growth:
Memory utilization over 7 days:
Day 0: 45%
Day 1: 48%
Day 2: 51%
Day 3: 54%
...
Without monitoring, this would have eventually caused OOM. With monitoring, I detected and fixed the leak early.
Example 2: Performance Regression
After a deployment, latency metrics showed:
P99 latency:
Before: 2.0μs
After: 3.5μs (+75%!)
Immediate rollback and investigation revealed a bad optimization. Without metrics, users would have complained (maybe).
Example 3: Capacity Planning
Traffic metrics showed growth trend:
Packets per second:
Jan: 2.1M
Feb: 2.3M (+9%)
Mar: 2.5M (+9%)
Projected: 3.2M by June
This data justified hardware capacity planning six months in advance.
Best Practices
From two years of building observability:
1. Instrument from Day One
Don’t add observability after the fact. Build it in from the start.
2. Make Metrics Cheap
If metrics add >1% overhead, you’ll disable them in production. Keep them lightweight.
3. Sample Aggressively
You don’t need every log line. Sample debug logs, full traces, and verbose events.
4. Store Metrics, Not Logs
Metrics compress better and query faster than logs. Use logs for context, metrics for trends.
5. Alert on Symptoms, Not Causes
Alert when users are impacted (high latency, errors), not on proxies (CPU usage).
6. Reduce Alert Fatigue
False positives destroy trust. Tune thresholds carefully and use alert escalation.
Conclusion
Observability is not optional for production systems. You cannot fix what you cannot see. The investment in comprehensive monitoring pays for itself the first time you debug a production issue in minutes instead of days.
The three pillars work together:
- Metrics tell you what is wrong
- Logs tell you why it’s wrong
- Traces tell you where it’s wrong
Build all three into your systems from the beginning. Your future self will thank you.
In FC-Redirect, observability has:
- Reduced mean time to detection from hours to seconds
- Reduced mean time to resolution from days to minutes
- Enabled proactive capacity planning
- Prevented outages through early warning
Observability is not a feature. It’s a requirement.