Tales from the Field: Debugging Customer Issues in Production

Working on FC-Redirect means supporting customers running mission-critical storage infrastructure. When things go wrong, they go wrong at 3 AM, and the stakes are high. This year, I’ve debugged some fascinating production issues. Here are the most instructive ones.

Case 1: The Mysterious Throughput Degradation

The Problem

A financial services customer reported gradual throughput degradation over 72 hours. Their FC fabric would start at normal performance (2M IOPS) but slowly degrade to 60% capacity. Rebooting temporarily fixed it, but the degradation would recur.

The Investigation

Initial theories:

Memory leak (gradual resource exhaustion)
Queue buildup (backpressure somewhere)
External interference (competing workload)

I collected data remotely:

# Memory usage over time
while true; do
    echo "$(date),$(free -m | grep Mem | awk '{print $3}')" >> memory.log
    sleep 60
done

# Flow table statistics
while true; do
    echo "$(date),$(cli show flow-table stats)" >> flows.log
    sleep 60
done

# CPU utilization
sar -u 60 > cpu.log &

The data revealed something strange: memory usage was stable, CPU usage was stable, but flow lookup times were gradually increasing.

The Root Cause

I analyzed the flow table dump and noticed something odd:

Hour 0:  12,487 flows, avg lookup: 850ns
Hour 24: 12,491 flows, avg lookup: 1,243ns
Hour 48: 12,489 flows, avg lookup: 1,687ns
Hour 72: 12,492 flows, avg lookup: 2,103ns

Flow count was constant, but lookup time doubled. This pointed to hash table degradation.

Examining the flow table structure:

void analyze_hash_table_quality(flow_table_t *table) {
    uint32_t empty_buckets = 0;
    uint32_t max_chain_length = 0;
    histogram_t chain_length_dist = {0};

    for (int i = 0; i < table->capacity; i++) {
        uint32_t chain_length = 0;
        flow_entry_t *entry = &table->entries[i];

        while (entry->in_use) {
            chain_length++;
            entry = entry->next;
        }

        if (chain_length == 0) {
            empty_buckets++;
        } else {
            histogram_add(&chain_length_dist, chain_length);
            if (chain_length > max_chain_length) {
                max_chain_length = chain_length;
            }
        }
    }

    printf("Empty buckets: %u / %u (%.2f%%)\n",
           empty_buckets, table->capacity,
           100.0 * empty_buckets / table->capacity);
    printf("Max chain length: %u\n", max_chain_length);
    printf("Chain length distribution:\n");
    print_histogram(&chain_length_dist);
}

The output was revealing:

Hour 0:
  Empty buckets: 3,247 / 16,384 (19.8%)
  Max chain length: 4

Hour 72:
  Empty buckets: 3,241 / 16,384 (19.8%)
  Max chain length: 847  <-- This is the problem!

Over time, one hash bucket had accumulated 847 entries! Every lookup probing that bucket would traverse hundreds of entries.

Why Did This Happen?

The customer was running a specific workload pattern:

Flows created with sequential WWPNs (storage array LUN provisioning)
Our hash function had poor distribution for sequential inputs
New flows kept hashing to the same bucket
Old flows in that bucket weren’t timing out (persistent connections)

The hash function:

// Original hash function (bad for sequential WWPNs)
uint32_t wwpn_hash_bad(wwpn_t wwpn) {
    return (uint32_t)(wwpn & 0xFFFFFFFF);  // Just use lower 32 bits
}

// Sequential WWPNs:
// 20:00:00:25:b5:00:00:01
// 20:00:00:25:b5:00:00:02
// 20:00:00:25:b5:00:00:03
// All hash to almost the same value!

The Fix

I implemented a better hash function with good avalanche properties:

uint32_t wwpn_hash_improved(wwpn_t wwpn) {
    // Mix all bits thoroughly
    uint64_t h = wwpn;

    h ^= (h >> 33);
    h *= 0xff51afd7ed558ccdULL;
    h ^= (h >> 33);
    h *= 0xc4ceb9fe1a85ec53ULL;
    h ^= (h >> 33);

    return (uint32_t)h;
}

Testing with the customer’s WWPN patterns:

Old hash function:
  Bucket 0x0001: 847 entries
  Bucket 0x0002: 734 entries
  Bucket 0x0003: 612 entries
  Max chain: 847

New hash function:
  Bucket 0x4a3f: 2 entries
  Bucket 0x8b21: 3 entries
  Bucket 0xc704: 1 entry
  Max chain: 4

Perfect distribution! We deployed the fix, and throughput remained stable.

Lesson Learned

Hash functions must be tested with real-world data patterns, not just random data. Sequential patterns are common in storage (LUN provisioning, incremental backups) and can expose weak hash functions.

Case 2: The Intermittent Corruption

The Problem

A healthcare customer saw occasional flow statistics corruption: packet counts jumping by billions, byte counts going negative, or WWPNs changing mid-flow. It happened 1-2 times per week, always self-correcting within seconds.

The Investigation

This screamed “race condition,” but we’d carefully used atomic operations everywhere. I added lock-free tracing:

#define TRACE_BUFFER_SIZE (1024 * 1024)

typedef struct trace_entry {
    uint64_t timestamp;
    uint32_t thread_id;
    uint32_t operation;
    wwpn_t flow_key;
    uint64_t old_value;
    uint64_t new_value;
    void *caller_address;
} trace_entry_t;

atomic_uint64_t trace_index = 0;
trace_entry_t trace_buffer[TRACE_BUFFER_SIZE];

static inline void trace_flow_operation(uint32_t op, wwpn_t key,
                                       uint64_t old, uint64_t new) {
    uint64_t idx = atomic_fetch_add(&trace_index, 1);
    trace_entry_t *entry = &trace_buffer[idx % TRACE_BUFFER_SIZE];

    entry->timestamp = rdtsc();
    entry->thread_id = gettid();
    entry->operation = op;
    entry->flow_key = key;
    entry->old_value = old;
    entry->new_value = new;
    entry->caller_address = __builtin_return_address(0);
}

When corruption occurred, I dumped the trace buffer and found:

Thread 4: FLOW_UPDATE bytes 1234567890 -> 1234568890
Thread 7: FLOW_UPDATE bytes 1234568890 -> 1234569890
Thread 4: FLOW_UPDATE bytes 1234567890 -> 1234570890  <-- Used stale value!
Thread 7: FLOW_READ  bytes = 18446744073709550616    <-- Corruption!

Thread 4 read the value, Thread 7 updated it, then Thread 4 wrote back based on the stale read. This is a classic read-modify-write race.

The Root Cause

I examined the code:

void update_flow_bytes(flow_entry_t *flow, uint32_t new_bytes) {
    uint64_t current = atomic_load(&flow->bytes);  // Read
    uint64_t updated = current + new_bytes;         // Modify
    atomic_store(&flow->bytes, updated);            // Write (WRONG!)
}

The individual operations were atomic, but the RMW sequence wasn’t! This code existed in several “optimized” paths where an engineer had replaced compare-and-swap with load/store, believing it was “faster.”

The Fix

Replace all RMW sequences with atomic RMW operations:

void update_flow_bytes_correct(flow_entry_t *flow, uint32_t new_bytes) {
    uint64_t current, updated;

    do {
        current = atomic_load(&flow->bytes);
        updated = current + new_bytes;
    } while (!atomic_compare_exchange_weak(&flow->bytes, &current, updated));
}

// Or even better, use fetch_add
void update_flow_bytes_best(flow_entry_t *flow, uint32_t new_bytes) {
    atomic_fetch_add(&flow->bytes, new_bytes);
}

Lesson Learned

Atomicity is not composable. Just because individual operations are atomic doesn’t mean sequences are. Read-modify-write must be atomic as a whole, not piecemeal.

Also: “optimizations” that remove correctness guarantees are bugs, no matter how fast they are.

Case 3: The Silent Data Loss

The Problem

An insurance company reported that some flow statistics were never appearing in their analytics database. Flows existed in our system, but some updates weren’t reaching their downstream systems.

The Investigation

This was subtle. No errors logged, no crashes, just occasional missing data. I added detailed logging to the replication path:

void replicate_flow_update(flow_update_t *update) {
    log_debug("Replicating flow %016lx, seq=%lu",
              update->flow_key, update->sequence_number);

    for (int i = 0; i < num_replicas; i++) {
        bool success = send_to_replica(&replicas[i], update);

        log_debug("Replica %d: %s", i, success ? "OK" : "FAILED");

        if (!success) {
            enqueue_retry(update);
        }
    }
}

The logs showed that replication was succeeding. But the async retry queue was occasionally dropping updates:

bool enqueue_retry(flow_update_t *update) {
    if (queue_full(&retry_queue)) {
        log_warn("Retry queue full, dropping update");  // <-- Here!
        return false;
    }

    enqueue(&retry_queue, update);
    return true;
}

When the retry queue filled (during temporary network issues), we were silently dropping updates!

The Fix

Never silently drop data. Instead, implement backpressure:

bool enqueue_retry_with_backpressure(flow_update_t *update) {
    while (queue_full(&retry_queue)) {
        // Block until space available
        log_warn("Retry queue full, waiting...");
        usleep(10*1000);  // 10ms

        // Optional: shed low-priority updates
        if (update->priority == PRIORITY_LOW) {
            log_warn("Dropping low-priority update due to backpressure");
            return false;
        }
    }

    enqueue(&retry_queue, update);
    return true;
}

Also added monitoring:

void monitor_retry_queue() {
    uint32_t depth = queue_depth(&retry_queue);
    uint32_t capacity = queue_capacity(&retry_queue);

    if (depth > capacity * 0.8) {
        alert("Retry queue 80% full", SEVERITY_WARNING);
    }

    if (depth >= capacity * 0.95) {
        alert("Retry queue 95% full", SEVERITY_CRITICAL);
    }
}

Lesson Learned

Silently dropping data is almost always wrong. If you must drop, make it loud (log, alert, metric). Better yet, implement backpressure to prevent drops.

Also: monitoring queue depths prevents silent disasters.

Case 4: The Performance Cliff

The Problem

A retail customer experienced sudden 80% throughput drops when crossing 10,000 concurrent flows. Below 10K: normal performance. Above 10K: cliff.

The Investigation

This suggested a capacity limit or threshold. I looked for magic numbers in the code:

grep -r "10000\|10240\|0x2800" src/

Found nothing obvious. Then I profiled at different flow counts:

8,000 flows:  avg CPU = 45%, avg latency = 2.1μs
10,000 flows: avg CPU = 52%, avg latency = 2.4μs
10,001 flows: avg CPU = 89%, avg latency = 11.8μs  <-- Cliff!
12,000 flows: avg CPU = 94%, avg latency = 14.2μs

Something changed drastically at 10,001 flows. I examined the flow table resize logic:

#define INITIAL_CAPACITY 8192
#define LOAD_FACTOR_THRESHOLD 0.75

void maybe_resize_flow_table(flow_table_t *table) {
    float load_factor = (float)table->count / table->capacity;

    if (load_factor > LOAD_FACTOR_THRESHOLD) {
        resize_flow_table(table, table->capacity * 2);
    }
}

void resize_flow_table(flow_table_t *table, uint32_t new_capacity) {
    // Allocate new table
    flow_entry_t *new_entries = calloc(new_capacity, sizeof(flow_entry_t));

    // Rehash all entries
    for (int i = 0; i < table->capacity; i++) {
        if (table->entries[i].in_use) {
            rehash_entry(&table->entries[i], new_entries, new_capacity);
        }
    }

    // Swap tables
    free(table->entries);
    table->entries = new_entries;
    table->capacity = new_capacity;
}

At 10,000 flows with capacity 8,192, load factor is 1.22, well over threshold. The table should resize to 16,384. But examining memory:

# Check actual table size
gdb -p $(pidof fc_redirect) -batch \
    -ex 'p flow_table->capacity'

$1 = 8192  # Still 8K, not 16K!

The resize was failing! Checking logs:

[ERROR] Failed to allocate 1048576 bytes for flow table resize

The system was out of memory, so resize failed silently, and the table operated at 122% load factor with massive collision chains.

The Fix

Multiple fixes:

Reserve memory for resize:

void *resize_buffer = NULL;

void init_flow_table() {
    // Pre-allocate memory for future resize
    resize_buffer = malloc(MAX_CAPACITY * sizeof(flow_entry_t));
}

void resize_flow_table_safe(flow_table_t *table, uint32_t new_capacity) {
    if (resize_buffer == NULL) {
        log_error("Resize failed: no buffer available");
        return;
    }

    // Use pre-allocated buffer
    memset(resize_buffer, 0, new_capacity * sizeof(flow_entry_t));

    // Rehash entries
    for (int i = 0; i < table->capacity; i++) {
        if (table->entries[i].in_use) {
            rehash_entry(&table->entries[i], resize_buffer, new_capacity);
        }
    }

    // Swap
    void *old = table->entries;
    table->entries = resize_buffer;
    table->capacity = new_capacity;
    resize_buffer = old;  // Reuse old buffer for next resize
}

Monitor resize failures:

void maybe_resize_flow_table_monitored(flow_table_t *table) {
    float load_factor = (float)table->count / table->capacity;

    if (load_factor > LOAD_FACTOR_THRESHOLD) {
        if (resize_flow_table_safe(table, table->capacity * 2)) {
            log_info("Flow table resized to %u", table->capacity);
        } else {
            log_error("Flow table resize FAILED, load=%.2f", load_factor);
            alert("Flow table cannot resize", SEVERITY_CRITICAL);
        }
    }
}

Graceful degradation:

if (load_factor > 1.5) {
    // Critically overloaded, shed low-priority flows
    evict_low_priority_flows(table, table->count * 0.1);
}

Lesson Learned

Always check return values, especially for allocations. Silent failures lead to mysterious performance cliffs.

Also: pre-allocate critical resources when possible.

Common Patterns

Across these issues, common patterns emerged:

Race conditions appear under load: They’re timing-dependent, so they emerge only at scale.
Hash functions fail on real data: Test with actual customer data patterns.
Silent failures are dangerous: If something fails, make noise.
Memory allocation can fail: Check return values, pre-allocate when possible.
Monitoring prevents surprises: Queue depths, error rates, resource utilization all need monitoring.

Debugging Toolkit

Tools that saved me:

Detailed logging (with log levels to control verbosity)
Lock-free tracing (for timing-sensitive bugs)
Core dumps (configured to generate on crashes)
Live debugging with gdb
Metrics and monitoring (Prometheus + Grafana)
Packet captures (tcpdump for protocol issues)

Conclusion

Production debugging is where theory meets reality. These issues taught me more than months of development. Each customer issue makes the product stronger and makes me a better engineer.

The key lessons:

Never assume correctness; verify with tests and monitoring
Real-world data patterns differ from synthetic tests
Silent failures are the worst failures
Monitoring and observability are not optional

As FC-Redirect continues growing through 2013, I’m sure more interesting issues will emerge. That’s the nature of distributed systems at scale. But each issue is an opportunity to learn and improve.

Thanks to all the customers who reported these issues and worked with me to debug them. Your production environments are the ultimate test lab.

Case 1: The Mysterious Throughput Degradation

The Problem

The Investigation

The Root Cause

Why Did This Happen?

The Fix

Lesson Learned

Case 2: The Intermittent Corruption

The Problem

The Investigation

The Root Cause

The Fix

Lesson Learned

Case 3: The Silent Data Loss

The Problem

The Investigation

The Fix

Lesson Learned

Case 4: The Performance Cliff

The Problem

The Investigation

The Fix

Lesson Learned

Common Patterns

Debugging Toolkit

Conclusion

Related Posts

Debugging the Impossible: Tracking Down a Heisenbug in Production

Scaling FC-Redirect: From 1K to 12K Flows

2014 Year in Review: Evolution and Maturity