Data Structures: The Foundation of Performance

There’s a saying in computer science: “Algorithm + Data Structure = Program.” After spending the last few months optimizing FC-Redirect, I’d add a corollary: “Data Structure = 90% of Performance.” Let me share how rethinking our data structures transformed our system’s performance.

The Hidden Cost of Poor Data Structure Choices

When I joined the FC-Redirect project, the codebase was functional and well-written, but it made some data structure choices that seemed reasonable at small scale but became problematic at large scale. The most impactful example was our flow tracking system.

The Original Implementation

Flow tracking used a straightforward linked list:

typedef struct flow_entry {
    wwpn_t source_wwpn;
    wwpn_t dest_wwpn;
    flow_state_t state;
    uint64_t packets;
    uint64_t bytes;
    struct flow_entry *next;
} flow_entry_t;

typedef struct flow_list {
    flow_entry_t *head;
    uint32_t count;
    pthread_mutex_t lock;
} flow_list_t;

At 1,000 flows, this works fine. Linear searches through 1,000 items are fast enough that nobody notices. But at 10,000 flows, we’re averaging 5,000 comparisons per lookup. At 100,000 lookups per second, that’s 500 million comparisons per second just to find flows.

Profiling Reveals the Truth

Before optimizing, I instrumented the code with high-resolution timing:

uint64_t total_lookup_time = 0;
uint64_t lookup_count = 0;

flow_entry_t* find_flow_instrumented(flow_list_t *list, wwpn_t key) {
    uint64_t start = rdtsc();

    flow_entry_t *entry = find_flow(list, key);

    uint64_t end = rdtsc();
    total_lookup_time += (end - start);
    lookup_count++;

    return entry;
}

The profiling data was damning. At 12K flows:

Average lookup: 120 microseconds
P99 lookup: 340 microseconds
58% of CPU time spent in flow lookups
89% of lookup time spent traversing the linked list

This single function was the bottleneck limiting our entire system.

Hash Table Redesign

The solution was clear: replace the O(n) linked list with an O(1) hash table. But the implementation details mattered enormously.

Hash Function Selection

Fibre Channel WWPNs have specific structure and properties. I designed a hash function that leverages this:

// WWPN format: XX:XX:XX:XX:XX:XX:XX:XX (8 bytes)
// Vendor ID in first 3 bytes, device specific in last 5

static inline uint32_t wwpn_hash(wwpn_t wwpn) {
    // Mix all bytes to distribute vendor clustering
    uint64_t h = wwpn;
    h ^= (h >> 33);
    h *= 0xff51afd7ed558ccdULL;
    h ^= (h >> 33);
    h *= 0xc4ceb9fe1a85ec53ULL;
    h ^= (h >> 33);
    return (uint32_t)h;
}

This hash function provides excellent distribution even when WWPNs cluster by vendor (which they do in real deployments). Testing against real customer WWPN datasets showed less than 2% collision rates.

Open Addressing vs. Chaining

I tested both collision resolution strategies:

Chaining (linked lists at each bucket):

Pros: Simple, handles high load factors well
Cons: Poor cache locality, additional memory allocations

Open addressing (store entries directly in array):

Pros: Excellent cache locality, no allocations
Cons: Requires good hash function, performance degrades at high load

For our use case, open addressing with linear probing won decisively:

typedef struct flow_table {
    flow_entry_t *entries;  // Direct array of entries
    uint32_t capacity;      // Power of 2
    uint32_t mask;          // capacity - 1
    uint32_t count;
    float load_factor;      // Trigger resize at 0.7
} flow_table_t;

flow_entry_t* flow_table_lookup(flow_table_t *table, wwpn_t key) {
    uint32_t hash = wwpn_hash(key);
    uint32_t index = hash & table->mask;

    // Linear probing
    while (table->entries[index].in_use) {
        if (table->entries[index].source_wwpn == key) {
            return &table->entries[index];
        }
        index = (index + 1) & table->mask;
    }

    return NULL;
}

The cache locality benefits were substantial. With open addressing, each probe hits adjacent memory, maximizing CPU cache utilization.

Memory Layout Optimization

Beyond algorithmic complexity, memory layout profoundly impacts performance. I restructured our flow entries for cache efficiency:

Before: Cache-Unfriendly Layout

typedef struct flow_entry {
    wwpn_t source_wwpn;      // 8 bytes
    wwpn_t dest_wwpn;        // 8 bytes
    flow_state_t state;      // 4 bytes
    uint32_t priority;       // 4 bytes
    uint64_t packets;        // 8 bytes
    uint64_t bytes;          // 8 bytes
    timestamp_t created;     // 8 bytes
    timestamp_t last_active; // 8 bytes
    stats_t *statistics;     // 8 bytes pointer
    metadata_t *metadata;    // 8 bytes pointer
    // Total: 72 bytes per entry
} flow_entry_t;

After: Cache-Optimized Layout

I separated hot (frequently accessed) and cold (rarely accessed) data:

// Hot path: fits in single cache line (64 bytes)
typedef struct flow_entry_hot {
    wwpn_t source_wwpn;      // 8 bytes
    wwpn_t dest_wwpn;        // 8 bytes
    uint32_t state_and_flags; // 4 bytes (packed)
    uint32_t cold_data_index; // 4 bytes
    uint64_t packets;        // 8 bytes
    uint64_t bytes;          // 8 bytes
    uint64_t last_active;    // 8 bytes
    uint32_t hash_chain;     // 4 bytes
    uint32_t padding;        // 4 bytes
    // Total: 64 bytes - perfect cache line fit
} flow_entry_hot_t;

// Cold path: accessed rarely
typedef struct flow_entry_cold {
    timestamp_t created;
    stats_t statistics;
    metadata_t metadata;
    // Separate storage, accessed only when needed
} flow_entry_cold_t;

This hot/cold split meant that common operations (flow lookups, packet counting) touched only 64 bytes per entry instead of 72+, and that 64 bytes fit perfectly in a single CPU cache line.

Results: Orders of Magnitude Improvement

The data structure changes delivered dramatic improvements:

Lookup Performance:

Average: 120μs → 1.8μs (67x faster)
P99: 340μs → 4.2μs (81x faster)
CPU time in lookups: 58% → 4%

Memory Efficiency:

12K flows: 864KB → 768KB (11% reduction)
Cache misses: 78% → 12% (6x reduction)
TLB pressure: Significantly reduced

System-Wide Impact:

Overall throughput: +20%
CPU utilization: -35%
Latency variance: -62%

Lessons for Data Structure Design

This experience reinforced several critical principles:

1. Big-O Matters in Practice

Theoretical complexity directly translates to real-world performance. O(n) vs O(1) isn’t academic; it’s the difference between success and failure at scale.

2. Cache Locality Is Paramount

Modern CPUs are so fast that memory access is often the bottleneck. Data structures that maximize cache utilization can be 10x faster than algorithmically equivalent alternatives.

3. Measure Real Workloads

Synthetic benchmarks are useful, but test with real customer data. Our WWPN hash function only emerged from analyzing actual deployment patterns.

4. Hot/Cold Separation Pays Off

Most data structures have frequently-accessed and rarely-accessed fields. Separating them improves cache efficiency and can enable other optimizations.

5. Alignment and Padding Matter

That cache line alignment wasn’t accidental. Ensuring structures align to cache line boundaries eliminates false sharing and maximizes cache efficiency.

Looking Forward

These data structure optimizations have been so successful that I’m now reviewing other components of our storage networking stack for similar opportunities. The patterns are broadly applicable:

Flow tables in switches and routers
Connection tracking in firewalls
Session management in load balancers
Index structures in databases

As we continue scaling FC-Redirect throughout 2013, these foundations ensure we can handle whatever demands customers throw at us. But more importantly, they’ve reminded me that sometimes the best optimizations aren’t clever algorithms or fancy techniques; they’re getting the fundamentals right.

Choose your data structures wisely. Everything else builds on that foundation.