Debugging the Impossible: Tracking Down a Heisenbug in Production

Last month, I spent three weeks hunting one of the most elusive bugs I’ve ever encountered. It only appeared in production, disappeared when we tried to reproduce it, and seemed to violate the laws of physics. This is the story of how I finally tracked it down.

The Mysterious Flow Corruption

It started with a customer ticket. A large financial services customer running FC-Redirect on their MDS 9250i reported occasional flow corruption. About once every 24-48 hours, a few flows would suddenly report impossible statistics: negative byte counts, packet counts jumping by billions, or destination WWPNs changing mid-flow.

The corruption always self-corrected within seconds, but it was triggering their monitoring alerts and creating audit trail issues. Worse, it made them question the reliability of our entire solution.

The Heisenbug Emerges

I started with the standard debugging approach: trying to reproduce the issue in our lab. I set up an identical configuration, replayed their traffic patterns, even tried to match their exact firmware versions and hardware configuration.

Nothing. The system ran flawlessly for weeks.

Then I tried the classic “add logging” approach. I instrumented every flow update with detailed logging and asked the customer to enable it. But here’s where things got weird: with logging enabled, the corruption never occurred. The bug literally disappeared when we tried to observe it. A textbook Heisenbug.

This suggested a timing-sensitive race condition. The overhead of logging was changing the timing just enough to mask the issue.

Building Better Observability

Since I couldn’t reproduce it directly, I needed better observability that wouldn’t affect timing. I built a lightweight tracing system using lock-free ring buffers:

#define TRACE_BUFFER_SIZE (1024 * 1024)

typedef struct trace_entry {
    uint64_t timestamp;
    uint32_t thread_id;
    uint32_t event_type;
    wwpn_t flow_key;
    uint64_t old_value;
    uint64_t new_value;
} trace_entry_t;

typedef struct trace_buffer {
    trace_entry_t entries[TRACE_BUFFER_SIZE];
    atomic_uint64_t write_index;
} trace_buffer_t;

static inline void trace_flow_update(wwpn_t flow, uint64_t old_val,
                                     uint64_t new_val, uint32_t event) {
    uint64_t index = atomic_fetch_add(&trace_buffer.write_index, 1);
    trace_entry_t *entry = &trace_buffer.entries[index % TRACE_BUFFER_SIZE];

    entry->timestamp = rdtsc();
    entry->thread_id = get_thread_id();
    entry->event_type = event;
    entry->flow_key = flow;
    entry->old_value = old_val;
    entry->new_value = new_val;
}

This tracing added only about 50 nanoseconds per operation and didn’t require any locks. When the corruption occurred, we could dump the ring buffer and see exactly what happened in the seconds before.

The Smoking Gun

Two days after deploying the instrumented version, the customer hit the issue again. I got a 3 AM call and immediately dumped the trace buffer. What I saw was bizarre:

Thread 4: flow=20:00:00:25:b5:00:00:0f update bytes 1234567890 -> 1234568890
Thread 7: flow=20:00:00:25:b5:00:00:0f update bytes 1234568890 -> 1234569890
Thread 4: flow=20:00:00:25:b5:00:00:0f update bytes 1234567890 -> 1234570890
Thread 7: flow=20:00:00:25:b5:00:00:0f read bytes = 18446744073709550616 (negative!)

Thread 4’s second update used a stale old_value (1234567890 instead of 1234569890), creating an inconsistent state. This pointed to a read-modify-write race condition, but we used atomic operations everywhere. How was this possible?

The Subtle Race Condition

I stared at the flow update code for hours:

void update_flow_bytes(flow_entry_t *flow, uint32_t new_bytes) {
    // Read current value
    uint64_t current = atomic_load(&flow->bytes);

    // Calculate new value
    uint64_t updated = current + new_bytes;

    // Write back (THIS IS THE BUG!)
    atomic_store(&flow->bytes, updated);
}

The individual operations were atomic, but the read-modify-write sequence wasn’t! Between the load and store, another thread could modify the value, and we’d overwrite it with a stale calculation.

The correct implementation required compare-and-swap:

void update_flow_bytes_fixed(flow_entry_t *flow, uint32_t new_bytes) {
    uint64_t current, updated;

    do {
        current = atomic_load(&flow->bytes);
        updated = current + new_bytes;
    } while (!atomic_compare_exchange_weak(&flow->bytes, &current, updated));
}

But wait, we already used CAS in some code paths. Why didn’t we use it everywhere?

The Optimization That Broke Things

Digging through the git history, I found the culprit. Six months ago, a well-meaning engineer had “optimized” some hot paths by replacing CAS loops with simple atomic stores, assuming they were good enough. The commit message said: “Optimization: replace expensive CAS with faster atomic store in update paths.”

The change passed all our tests because our test workloads weren’t stressing the race condition window. In production, with 12K flows and hundreds of thousands of updates per second across 8 cores, the race window was getting hit multiple times per day.

This is why premature optimization is dangerous. The engineer had “optimized” correct code into incorrect code, saving maybe 10 nanoseconds per operation while introducing subtle corruption.

The Fix and Validation

The fix was straightforward: revert to CAS-based updates everywhere:

// Update with retry loop
static inline void atomic_add_uint64(atomic_uint64_t *var, uint64_t delta) {
    uint64_t current, updated;
    do {
        current = atomic_load(var);
        updated = current + delta;
    } while (!atomic_compare_exchange_weak(var, &current, updated));
}

void update_flow_stats(flow_entry_t *flow, uint32_t packets, uint32_t bytes) {
    atomic_add_uint64(&flow->packets, packets);
    atomic_add_uint64(&flow->bytes, bytes);
}

But how to validate the fix? I couldn’t wait weeks for the customer to report whether the issue was gone. I needed a way to stress-test the race condition.

I built a targeted test that spawned 16 threads all hammering the same flows with updates:

void stress_test_concurrent_updates() {
    flow_entry_t flow = {0};
    atomic_uint64_t expected = 0;

    #pragma omp parallel for num_threads(16)
    for (int i = 0; i < 1000000; i++) {
        uint32_t delta = (i % 1000) + 1;
        update_flow_bytes(&flow, delta);
        atomic_add_uint64(&expected, delta);
    }

    assert(flow.bytes == expected);
}

With the broken code, this test failed within milliseconds. With the fix, it passed reliably even after billions of iterations.

Production Deployment and Resolution

After a week of stress testing, I deployed the fix to the customer. They ran it for 30 days with zero occurrences of the corruption. The bug was finally dead.

Lessons from the Trenches

This experience taught me several invaluable lessons:

1. Heisenbugs Reveal Timing Issues

When a bug disappears when you observe it, suspect race conditions or timing-dependent behavior. Don’t fight the phenomenon; use it as a diagnostic clue.

2. Build Observability That Doesn’t Perturb

Traditional logging often changes timing too much to catch timing bugs. Lock-free ring buffers and similar techniques let you observe without disrupting.

3. Atomicity Is Not Composable

Just because individual operations are atomic doesn’t mean sequences of them are. Read-modify-write requires explicit atomic RMW operations (CAS, fetch-add, etc.).

4. Stress Tests Beat Unit Tests for Concurrency

Traditional tests rarely catch race conditions. Targeted stress tests with many threads and iterations are essential for concurrent code validation.

5. Trust Your Customers

When a customer reports an issue you can’t reproduce, believe them. They’re running workloads and configurations you haven’t thought to test. Their production environment is your best test lab.

6. Document Why, Not Just What

The original “optimization” commit lacked explanation of why the change was safe. A comment explaining the concurrency assumptions would have prevented this bug.

The Broader Impact

This bug affected all versions of FC-Redirect for the past six months, but very few customers hit it because most didn’t run at sufficient scale. It’s a reminder that scale reveals issues that smaller deployments mask.

As we continue pushing toward higher scale throughout 2013, I’m reviewing all our atomic operations to ensure we haven’t made similar mistakes elsewhere. Concurrency is hard, and there are no shortcuts. Every shared variable needs careful analysis of its access patterns and appropriate synchronization.

Debugging is often seen as less glamorous than writing new features, but finding and fixing a bug like this is incredibly satisfying. The customer’s infrastructure is now more reliable, and our entire codebase is stronger. That’s a win worth celebrating.