Platform Migration: Lessons from Moving to MDS 9250i

Over the past quarter, I’ve been leading the effort to migrate FC-Redirect to Cisco’s new MDS 9250i platform. This multiservice platform combines Fibre Channel, FCOE, and Ethernet in a single chassis, representing the future of our storage networking line. The migration has been both challenging and enlightening.

Understanding the MDS 9250i Architecture

The MDS 9250i is fundamentally different from our previous platforms. Instead of dedicated Fibre Channel ASICs, it uses a hybrid architecture:

Merchant silicon for packet processing (Broadcom Trident)
Custom NPU for Fibre Channel-specific functions
Multi-core x86 for control plane (8-core Intel)
Unified memory architecture shared across subsystems

This architecture offers flexibility and convergence, but it also introduces new constraints and opportunities. The most significant difference is that packet processing now happens in software on general-purpose cores, rather than being fully offloaded to ASICs.

The Performance Challenge

My first concern was performance. Our previous ASIC-based implementation could process frames at line rate with sub-microsecond latency. Could a software-based approach match that?

Initial benchmarks were sobering:

ASIC implementation: 2.4 million frames/sec, 0.8μs latency
Initial x86 port: 800K frames/sec, 12μs latency

We had a 3x throughput gap and 15x latency gap to close. This required rethinking our entire packet processing pipeline.

Leveraging Platform Capabilities

Rather than trying to directly port the ASIC approach, I analyzed what the x86 architecture did better:

CPU-Specific Optimizations

Modern x86 processors have powerful SIMD instructions. I rewrote our flow lookup to use AVX2 for parallel comparisons:

// Compare 4 WWPNs simultaneously using AVX2
__m256i target = _mm256_set1_epi64x(search_wwpn);
__m256i candidates = _mm256_load_si256((__m256i*)&table[index]);
__m256i matches = _mm256_cmpeq_epi64(target, candidates);
int mask = _mm256_movemask_epi8(matches);

if (mask != 0) {
    // Found a match in this batch of 4
    int position = __builtin_ctz(mask) / 8;
    return &table[index + position];
}

This SIMD approach let us check 4 potential flow matches in parallel, effectively quadrupling our lookup throughput.

Cache Hierarchy Awareness

The x86 has a sophisticated cache hierarchy (L1: 32KB, L2: 256KB, L3: 12MB). I restructured our data layout to maximize cache utilization:

Hot data in L1: Flow keys and basic state fit in 64 bytes
Warm data in L2: Statistics and counters
Cold data in L3: Full flow details and metadata

By keeping the lookup path touching only L1-cached data, we achieved average lookup latency under 10 nanoseconds.

Huge Pages

Standard 4KB pages created significant TLB pressure with our 12K flow table. Switching to 2MB huge pages eliminated TLB misses entirely:

// Allocate flow table with huge pages
void* flow_table = mmap(NULL, table_size,
                        PROT_READ | PROT_WRITE,
                        MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                        -1, 0);

if (flow_table == MAP_FAILED) {
    // Fallback to normal pages
    flow_table = mmap(NULL, table_size,
                      PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_ANONYMOUS,
                      -1, 0);
}

TLB misses dropped from 15% of memory accesses to effectively zero, improving overall throughput by 8%.

Multi-Core Scaling

The 8-core architecture enabled parallelism impossible on our ASIC platform. I implemented a sophisticated work partitioning system:

Flow Affinity

Each flow is assigned to a specific core based on its WWPN hash. All packets for that flow are processed on the same core, eliminating cross-core synchronization:

static inline uint32_t flow_to_core(wwpn_t wwpn) {
    uint32_t hash = wwpn_hash(wwpn);
    return hash % num_processing_cores;
}

void process_packet(fc_frame_t *frame) {
    wwpn_t flow_key = extract_flow_key(frame);
    uint32_t core = flow_to_core(flow_key);

    enqueue_to_core(core, frame);
}

This design scaled nearly linearly across cores since each core operates independently on its subset of flows.

Lock-Free Communication

Inter-core communication used lock-free ring buffers for maximum efficiency:

typedef struct core_queue {
    atomic_uint64_t head;
    atomic_uint64_t tail;
    char padding1[64 - 16];  // Avoid false sharing

    fc_frame_t *frames[QUEUE_SIZE];
    char padding2[64];        // Cacheline align
} core_queue_t __attribute__((aligned(64)));

The careful padding ensures each core’s queue pointers live in separate cache lines, eliminating false sharing and maximizing throughput.

Backward Compatibility Challenges

FC-Redirect needed to work seamlessly across our entire platform family: legacy MDS switches, N7000 with FC modules, and the new 9250i. This created interesting challenges:

Protocol Version Negotiation

Different platforms supported different feature sets. I implemented capability negotiation:

typedef struct platform_caps {
    uint32_t max_flows;
    uint32_t protocol_version;
    uint32_t features;  // Bitmap of supported features
    uint32_t compression_types;
} platform_caps_t;

void negotiate_protocol(peer_node_t *peer) {
    platform_caps_t local_caps = get_local_capabilities();
    platform_caps_t peer_caps = exchange_capabilities(peer);

    // Use minimum common denominator
    peer->max_flows = min(local_caps.max_flows, peer_caps.max_flows);
    peer->protocol_version = min(local_caps.protocol_version,
                                  peer_caps.protocol_version);
    peer->features = local_caps.features & peer_caps.features;
}

This ensured graceful degradation when mixing platforms while allowing new platforms to use advanced features when talking to each other.

Performance Parity

Customers expected consistent performance regardless of platform mix. I implemented adaptive flow distribution that accounted for platform capabilities:

Newer platforms handle more flows
Latency-sensitive flows routed to fastest platforms
Load balancing accounts for per-platform capacity

This was critical for smooth migration; customers could introduce new hardware gradually without performance cliffs.

Power and Thermal Considerations

The x86-based architecture had very different power characteristics than ASICs. Under full load, power consumption was 40% higher. This required careful optimization:

Dynamic Frequency Scaling

When load dropped below 60%, I enabled CPU frequency scaling to reduce power:

void adjust_cpu_frequency() {
    uint32_t load = get_current_load_percent();

    if (load < 40) {
        set_cpu_frequency(CPU_FREQ_LOW);    // 1.2 GHz
    } else if (load < 70) {
        set_cpu_frequency(CPU_FREQ_MEDIUM); // 1.8 GHz
    } else {
        set_cpu_frequency(CPU_FREQ_HIGH);   // 2.4 GHz
    }
}

This reduced average power consumption by 25% in typical deployments while maintaining full performance when needed.

Idle Core Shutdown

When flows are unevenly distributed, some cores might be mostly idle. I implemented selective core parking:

void rebalance_and_park_cores() {
    if (total_load < 50 && num_active_cores > 4) {
        // Migrate flows from least-loaded core
        uint32_t victim_core = find_least_loaded_core();
        migrate_flows_to_other_cores(victim_core);
        park_core(victim_core);
    }
}

In low-load scenarios, this could shut down 4 cores, reducing power by an additional 20%.

Results and Impact

After four months of development and optimization, we achieved impressive results:

Performance:

Throughput: 2.1 million frames/sec (87% of ASIC performance)
Latency: 2.8μs average (3.5x the ASIC, but acceptable)
CPU efficiency: 65% utilization at max throughput
Flow capacity: 12,000 flows maintained

Power:

Full load: +40% vs ASIC (expected for x86)
Typical load: +15% (thanks to dynamic optimization)
Idle: -10% (better power gating than ASICs)

Deployment:

Backward compatible with all existing platforms
Smooth migration path for customers
30% better price/performance ratio

Lessons Learned

This migration reinforced several important principles:

Different platforms require different approaches: Don’t just port code; rearchitect for the target platform’s strengths.
Cache optimization matters more than instruction count: On modern processors, memory access patterns dominate performance.
Multi-core scaling isn’t automatic: Careful work partitioning and synchronization design are essential.
Power is a first-class constraint: In datacenter equipment, power and cooling costs matter as much as performance.
Backward compatibility requires planning: Protocol versioning and capability negotiation can’t be afterthoughts.

Looking Forward

The MDS 9250i migration positions us well for the future of storage networking. The convergence of FC, FCoE, and Ethernet on a unified platform reflects where the industry is headed. The software-centric architecture gives us agility to add features and optimizations that would be impossible with fixed-function ASICs.

As we roll this out to more customers throughout 2013, I’m excited to see how they leverage the new capabilities. The migration was challenging, but it’s opened up possibilities that will benefit our customers for years to come.

Platform migrations are never easy, but they’re opportunities to rethink assumptions and build something better. This one certainly delivered on that promise.