Preparing FC-Redirect for the MDS 9700: Next-Generation Platform

Cisco is preparing to launch the MDS 9700, the next generation of our storage networking platform. I’ve had early access to the architecture specifications and have been exploring what it means for FC-Redirect. The capabilities are impressive, but they require rethinking some of our fundamental assumptions.

MDS 9700 Architecture Overview

The 9700 represents a significant leap forward:

Hardware Capabilities:

768 ports per chassis (4x previous generation)
32 Gbps Fibre Channel support
96-core processing capability
DDR4 memory with 2x bandwidth
PCIe Gen3 internal fabric
Advanced ASIC with programmable pipeline

Key Differences from Previous Platforms:

MDS 9250i (Current):          MDS 9700 (Next-Gen):
┌─────────────────┐          ┌──────────────────────┐
│ 8-core x86      │          │ 96-core x86          │
│ 48 ports        │          │ 768 ports            │
│ 16 Gbps FC      │          │ 32 Gbps FC           │
│ DDR3 Memory     │          │ DDR4 Memory          │
│ PCIe Gen2       │          │ PCIe Gen3            │
└─────────────────┘          └──────────────────────┘

Scaling factor: 16x ports, 12x cores, 2x speed

The scale increase is massive. FC-Redirect must evolve to leverage this.

Challenge 1: Scaling to 768 Ports

Our current flow tracking architecture assumes ~100 ports max. 768 ports changes everything.

Port-Based Sharding

I’m implementing port-based sharding to distribute load:

#define MDS9700_MAX_PORTS 768
#define FLOWS_PER_PORT_SHARD 1000

typedef struct port_shard {
    port_id_t port_id;
    flow_table_t *flow_table;
    pthread_t processing_thread;
    cpu_set_t cpu_affinity;

    // Per-port statistics
    atomic_uint64_t packets_processed;
    atomic_uint64_t flows_active;
} port_shard_t;

typedef struct mds9700_context {
    port_shard_t shards[MDS9700_MAX_PORTS];
    int num_shards;

    // Global flow index for cross-port lookups
    global_flow_index_t *global_index;
} mds9700_context_t;

void init_port_sharding(mds9700_context_t *ctx, int num_ports) {
    ctx->num_shards = num_ports;

    for (int i = 0; i < num_ports; i++) {
        port_shard_t *shard = &ctx->shards[i];

        shard->port_id = i;
        shard->flow_table = create_flow_table(FLOWS_PER_PORT_SHARD);

        // Pin to specific CPU cores
        int cpu = i % 96;  // Spread across 96 cores
        CPU_ZERO(&shard->cpu_affinity);
        CPU_SET(cpu, &shard->cpu_affinity);

        // Create processing thread
        pthread_create(&shard->processing_thread, NULL,
                      port_processing_thread, shard);

        pthread_setaffinity_np(shard->processing_thread,
                              sizeof(cpu_set_t),
                              &shard->cpu_affinity);
    }
}

This architecture scales linearly with port count.

NUMA-Aware Memory

With 96 cores, NUMA becomes critical:

void allocate_port_shard_numa_aware(port_shard_t *shard, int cpu) {
    // Determine NUMA node for this CPU
    int numa_node = cpu / 24;  // 4 NUMA nodes, 24 cores each

    // Allocate shard data on correct NUMA node
    numa_set_preferred(numa_node);

    shard->flow_table = create_flow_table(FLOWS_PER_PORT_SHARD);
    shard->packet_buffers = allocate_packet_buffers(BUFFER_SIZE);

    // Verify allocation on correct node
    int actual_node = numa_node_of_cpu(cpu);
    assert(actual_node == numa_node);
}

NUMA-aware allocation prevents expensive cross-node memory access.

Challenge 2: 32 Gbps Fibre Channel

32 Gbps FC means 2x packet rate compared to 16 Gbps:

Performance Requirements:

16 Gbps: ~4M packets/sec
32 Gbps: ~8M packets/sec

Our current implementation maxes out at 2.9M packets/sec. We need major optimizations.

Hardware Offload

The 9700’s ASIC supports programmable packet processing:

// Program ASIC for fast-path flow processing
typedef struct asic_flow_entry {
    fc_id_t src_fcid;
    fc_id_t dst_fcid;

    // Action
    enum {
        ASIC_ACTION_FORWARD,
        ASIC_ACTION_REDIRECT,
        ASIC_ACTION_DROP,
        ASIC_ACTION_MIRROR
    } action;

    port_id_t output_port;
    uint8_t qos_class;

    // Counters (updated in hardware)
    uint64_t packet_count;
    uint64_t byte_count;
} asic_flow_entry_t;

bool program_asic_flow(asic_interface_t *asic,
                      flow_entry_t *flow) {
    asic_flow_entry_t hw_entry = {
        .src_fcid = flow->src_fcid,
        .dst_fcid = flow->dst_fcid,
        .action = translate_action(flow->policy.action),
        .output_port = flow->policy.output_port,
        .qos_class = flow->policy.qos_class
    };

    // Install in ASIC flow table (TCAM)
    return asic_install_flow_entry(asic, &hw_entry);
}

Hardware processing handles the fast path at line rate. Software handles exceptions.

Batch Processing

Process packets in batches for better cache utilization:

#define BATCH_SIZE 64

typedef struct packet_batch {
    packet_t *packets[BATCH_SIZE];
    int count;
} packet_batch_t;

void process_packet_batch(packet_batch_t *batch) {
    // Prefetch flow entries
    for (int i = 0; i < batch->count; i++) {
        wwpn_t key = batch->packets[i]->flow_key;
        flow_entry_t *flow = lookup_flow_prefetch(key);
        __builtin_prefetch(flow, 0, 3);  // Prefetch for read
    }

    // Process all packets in batch
    for (int i = 0; i < batch->count; i++) {
        packet_t *pkt = batch->packets[i];
        flow_entry_t *flow = lookup_flow(pkt->flow_key);

        // Flow already in cache from prefetch
        apply_flow_policy(pkt, flow);
        forward_packet(pkt);
    }
}

Batching + prefetching reduces cache misses significantly.

Challenge 3: Massive Flow Scale

768 ports × 1000 flows/port = 768K potential concurrent flows:

Hierarchical Flow Tables

Flat hash tables don’t scale to 768K entries efficiently. I’m implementing hierarchical tables:

typedef struct hierarchical_flow_table {
    // Level 1: Per-port tables (hot data)
    flow_table_t *port_tables[MDS9700_MAX_PORTS];

    // Level 2: Global overflow table (cold data)
    flow_table_t *overflow_table;

    // Level 3: Disk-backed archive (historical)
    flow_archive_t *archive;

    // Statistics
    atomic_uint64_t l1_hits;
    atomic_uint64_t l2_hits;
    atomic_uint64_t l3_hits;
} hierarchical_flow_table_t;

flow_entry_t* hierarchical_lookup(hierarchical_flow_table_t *table,
                                  port_id_t port,
                                  wwpn_t flow_key) {
    // L1: Check port-local table (hot path)
    flow_entry_t *flow = flow_table_lookup(table->port_tables[port],
                                           flow_key);
    if (flow != NULL) {
        atomic_fetch_add(&table->l1_hits, 1);
        return flow;
    }

    // L2: Check global overflow
    flow = flow_table_lookup(table->overflow_table, flow_key);
    if (flow != NULL) {
        atomic_fetch_add(&table->l2_hits, 1);

        // Promote to L1 if frequently accessed
        if (should_promote_to_l1(flow)) {
            promote_flow_to_l1(table, port, flow);
        }

        return flow;
    }

    // L3: Check archive (rare)
    flow = flow_archive_lookup(table->archive, flow_key);
    if (flow != NULL) {
        atomic_fetch_add(&table->l3_hits, 1);

        // Restore to L2
        flow_table_insert(table->overflow_table, flow);

        return flow;
    }

    return NULL;
}

This hierarchy keeps hot data in fast tables and cold data in slower storage.

Challenge 4: Advanced Telemetry

The 9700 supports in-band telemetry (INT - Inband Network Telemetry):

Telemetry Integration

typedef struct flow_telemetry {
    flow_id_t flow_id;

    // Per-hop metrics
    struct hop_telemetry {
        device_id_t device;
        port_id_t ingress_port;
        port_id_t egress_port;
        timestamp_t timestamp;
        uint32_t queue_depth;
        uint32_t latency_ns;
    } hops[MAX_HOPS];

    int num_hops;
} flow_telemetry_t;

void process_telemetry_packet(packet_t *pkt) {
    // Extract INT data from packet
    flow_telemetry_t telemetry = {0};
    extract_int_data(pkt, &telemetry);

    // Update flow with telemetry data
    flow_entry_t *flow = lookup_flow(telemetry.flow_id);
    if (flow != NULL) {
        update_flow_telemetry(flow, &telemetry);

        // Detect anomalies
        if (detect_congestion(&telemetry)) {
            trigger_congestion_avoidance(flow);
        }

        if (detect_asymmetric_routing(&telemetry)) {
            log_routing_asymmetry(flow);
        }
    }
}

INT provides unprecedented visibility into flow behavior.

Challenge 5: Energy Efficiency

768 ports draw significant power. The 9700 supports dynamic power management:

typedef struct port_power_state {
    port_id_t port;

    enum {
        POWER_FULL,      // 100% - active traffic
        POWER_REDUCED,   // 60% - idle but ready
        POWER_SUSPEND,   // 10% - no traffic expected
        POWER_OFF        // 0% - administratively down
    } state;

    timestamp_t last_packet_time;
    uint32_t idle_time_sec;
} port_power_state_t;

void manage_port_power(port_power_state_t *port) {
    uint32_t idle_time = time_since_last_packet(port);

    if (port->state == POWER_FULL && idle_time > 60) {
        // No traffic for 60 seconds, reduce power
        set_port_power_state(port, POWER_REDUCED);
    } else if (port->state == POWER_REDUCED && idle_time > 300) {
        // No traffic for 5 minutes, suspend
        set_port_power_state(port, POWER_SUSPEND);
    } else if (port->state == POWER_SUSPEND && idle_time > 3600) {
        // No traffic for 1 hour, consider shutdown
        if (can_safely_shutdown_port(port)) {
            set_port_power_state(port, POWER_OFF);
        }
    }

    // Wake on packet arrival
    if (packet_arrived(port) && port->state != POWER_FULL) {
        set_port_power_state(port, POWER_FULL);
    }
}

Intelligent power management reduces operating costs significantly.

Testing at Scale

Testing FC-Redirect at 768-port scale requires new approaches:

Simulation Framework

typedef struct port_simulator {
    port_id_t port_id;
    traffic_generator_t *traffic_gen;

    // Simulated load
    uint32_t packets_per_second;
    uint32_t flows_active;
    traffic_pattern_t pattern;
} port_simulator_t;

void simulate_mds9700_load(int num_ports) {
    port_simulator_t sims[num_ports];

    // Configure realistic load per port
    for (int i = 0; i < num_ports; i++) {
        sims[i].port_id = i;
        sims[i].packets_per_second = 10000;  // 10K pps per port
        sims[i].flows_active = 100;
        sims[i].pattern = PATTERN_BURSTY;

        start_port_simulator(&sims[i]);
    }

    // Total: 768 ports × 10K pps = 7.68M pps
    // Monitor system behavior under load
    monitor_system_health(3600);  // 1 hour test

    // Collect results
    for (int i = 0; i < num_ports; i++) {
        port_stats_t stats = get_port_stats(&sims[i]);
        analyze_port_performance(&stats);
    }
}

Preliminary Results

Early testing on 9700 prototype hardware shows:

Performance:

Throughput: 7.2M packets/sec (2.5x improvement)
Latency: 1.8μs P99 (slight improvement)
Flow capacity: 500K concurrent flows (42x improvement)

Efficiency:

CPU utilization: 65% at max load (good headroom)
Memory: 48GB for 500K flows (0.1MB per flow)
Power: 2800W full load, 1400W typical (50% savings vs always-on)

Lessons for Next-Gen Platform Support

This work has taught me:

1. Scale Changes Everything

Techniques that work at small scale fail at large scale. Hierarchical approaches become essential.

2. Hardware Offload Is Critical

At 8M pps, software can’t keep up. Hardware acceleration is necessary for line-rate processing.

3. NUMA Awareness Is Mandatory

With many cores and NUMA architecture, memory locality determines performance.

4. Power Management Matters

At datacenter scale, power costs rival hardware costs. Efficiency is a first-class requirement.

5. Testing at Scale Is Hard

Simulating 768-port load requires specialized tooling and infrastructure.

Looking Forward

The MDS 9700 launches later this year. FC-Redirect will be ready to leverage its capabilities:

10x scale increase
2.5x performance improvement
Advanced telemetry integration
Significant power efficiency

This platform will serve customers for the next decade. The architectural work we’re doing now ensures FC-Redirect can grow with customer demands.

Next-generation platforms create opportunities to rethink assumptions and build better systems. The 9700 is such an opportunity.

As storage networks continue growing in scale and sophistication, platforms like the 9700 provide the foundation for meeting customer needs. I’m excited to see what customers build on this platform.

The future of storage networking is more scale, more speed, and more intelligence. The MDS 9700 delivers all three.