As we kicked off 2013, I faced one of the most challenging projects in my career at Cisco: scaling our FC-Redirect solution from 1,000 to 12,000 concurrent flows. This 12x increase wasn’t just about handling more data; it required fundamentally rethinking how we approached distributed state management and flow processing.
The Challenge
FC-Redirect is a critical component in our storage networking infrastructure, enabling intelligent traffic redirection across Fibre Channel fabrics. When customers started requesting support for massively scaled environments, we quickly realized our existing architecture wouldn’t cut it. The initial implementation worked beautifully for smaller deployments, but as we pushed beyond 2,000 flows in our test labs, we started seeing performance degradation and occasional flow tracking inconsistencies.
Identifying the Bottlenecks
The first step was understanding where our bottlenecks lived. I spent several weeks instrumenting the codebase with detailed performance counters and analyzing flow patterns. What I discovered was eye-opening:
-
Linear lookups: Our flow tracking used a simple linked list structure. At 1,000 flows, lookup times were negligible. At 10,000 flows, we were spending 40% of our CPU cycles just finding the right flow entry.
-
Synchronous state updates: Every flow state change triggered immediate synchronization across our distributed nodes. This created a cascade of network traffic that scaled quadratically with the number of flows.
-
Single-threaded processing: Our main event loop processed flow updates sequentially, creating a significant bottleneck under high load.
The Solution: Multi-Pronged Optimization
The scaling solution required changes across multiple layers of the architecture.
Hash Table Implementation
The most impactful change was replacing our linked list flow tracking with a custom hash table implementation. I designed a hash function that took advantage of the specific properties of Fibre Channel WWPNs (World Wide Port Names), achieving excellent distribution with minimal collisions.
// Previous approach: O(n) lookup
flow_entry_t* find_flow(flow_list_t *list, wwpn_t key) {
for (entry = list->head; entry != NULL; entry = entry->next) {
if (entry->wwpn == key) return entry;
}
return NULL;
}
// New approach: O(1) average case
flow_entry_t* find_flow_optimized(flow_table_t *table, wwpn_t key) {
uint32_t hash = wwpn_hash(key);
uint32_t index = hash & table->mask;
return table->buckets[index];
}
This single change reduced our flow lookup times from an average of 120 microseconds to under 2 microseconds at 12K flows.
Batched State Synchronization
Instead of synchronizing every state change immediately, I implemented a batching mechanism that aggregates changes over a 50-millisecond window. This reduced our synchronization traffic by over 80% while maintaining acceptable consistency guarantees.
The key insight was that most flow state changes happen in bursts, and many intermediate states are never observed by the system. By batching, we only propagate the final state of each flow within the window.
Event Loop Redesign
I restructured our main processing loop to leverage multiple threads for independent operations. Flow updates that don’t share state can now be processed in parallel across CPU cores. On our test MDS 9250i platform with 8 cores, this alone improved throughput by 3.5x.
Performance Results
After three months of development and testing, the results exceeded our targets:
- Flow capacity: Successfully tested with 12,000 concurrent flows
- Lookup performance: 98% reduction in average lookup time
- Throughput: 20% improvement in overall packet processing
- CPU utilization: 35% reduction under peak load
- Memory footprint: Only 15% increase despite 12x flow capacity
Lessons Learned
This project reinforced several important principles:
-
Measure first, optimize second: Without detailed instrumentation, I would have optimized the wrong things. The hash table change only became obvious after seeing the profiling data.
-
Data structures matter: The right data structure can make orders of magnitude difference in performance. Time spent on algorithmic design pays off exponentially.
-
Batch where possible: In distributed systems, batching and amortization are your friends. Don’t pay the synchronization cost for every tiny change.
-
Test at scale early: Problems that don’t appear at 1,000 flows become catastrophic at 10,000. Always test beyond your target scale.
Looking Forward
This scaling work has opened up new possibilities for our customers running massive storage infrastructures. We’re already seeing deployments leveraging this capability to consolidate their Fibre Channel fabrics and reduce complexity.
The techniques we developed here are broadly applicable to any distributed system dealing with large state tables. I’m excited to apply these lessons to other components of our storage networking stack throughout 2013.
As systems continue to grow in scale, the ability to handle 10x or 100x increases gracefully becomes table stakes. The challenge isn’t just making things work at scale; it’s maintaining performance and efficiency as you scale. That’s where the real engineering work happens.