Optimizing FC-Redirect for the Nexus 7000 Platform

The Nexus 7000 represents a different approach to data center networking compared to dedicated storage switches. As a modular platform supporting Ethernet, Fibre Channel, and FCoE, it requires careful optimization to deliver FC-Redirect performance competitive with purpose-built storage switches.

Understanding the N7000 Architecture

The N7000’s architecture is fundamentally different from the MDS family:

Hardware Architecture

N7000 Components:
┌─────────────────────────────────────┐
│       Supervisor Module             │
│    (Control Plane - x86 Linux)      │
└──────────┬──────────────────────────┘
           │
    ┌──────┴──────┬──────────┬────────┐
    │             │          │        │
┌───▼────┐  ┌────▼───┐  ┌───▼────┐  ...
│ Line   │  │ Line   │  │ Line   │
│ Card 1 │  │ Card 2 │  │ Card 3 │
│(F3/M3) │  │(F3/M3) │  │(F3/M3) │
└────────┘  └────────┘  └────────┘

Key characteristics:

Distributed forwarding: Each line card has local ASICs and CPU
Centralized control: Supervisor runs control plane software
Mixed traffic: FC, FCoE, and Ethernet on same chassis
Modular design: Line cards can be added/removed

This distributed architecture creates both challenges and opportunities.

The Performance Challenge

Initial FC-Redirect port to N7000 showed disappointing results:

MDS 9250i (baseline):

Throughput: 2.9M packets/sec
Latency: 2.0μs P99
CPU utilization: 45%

N7000 (initial port):

Throughput: 1.8M packets/sec (38% slower)
Latency: 4.5μs P99 (2.25x worse)
CPU utilization: 78%

We had significant optimization work ahead.

Optimization 1: Distributed Flow Processing

The N7000’s distributed architecture led me to rethink our centralized flow processing model.

Original Architecture (Centralized)

// All flow processing on supervisor
void process_packet_centralized(packet_t *pkt) {
    // Packet arrives at line card
    // Line card sends to supervisor
    // Supervisor looks up flow
    // Supervisor sends decision back to line card
    // Line card forwards packet

    // Result: 2 inter-module communications per packet
}

This incurred massive overhead: packets crossed the backplane twice per operation.

Optimized Architecture (Distributed)

// Flow processing distributed to line cards
typedef struct linecard_flow_cache {
    flow_entry_t *hot_flows;    // Recently used flows
    uint32_t cache_size;
    cache_stats_t stats;
} linecard_flow_cache_t;

void process_packet_distributed(packet_t *pkt, linecard_flow_cache_t *cache) {
    // Check line card local cache
    flow_entry_t *flow = lookup_local_cache(cache, pkt->flow_key);

    if (flow != NULL) {
        // Cache hit - process locally, no backplane crossing
        apply_flow_policy(pkt, flow);
        forward_packet_locally(pkt);
        cache->stats.hits++;
        return;
    }

    // Cache miss - query supervisor
    flow = query_supervisor_flow(pkt->flow_key);

    if (flow != NULL) {
        // Add to local cache
        insert_local_cache(cache, flow);
        apply_flow_policy(pkt, flow);
        forward_packet_locally(pkt);
        cache->stats.misses++;
    } else {
        // Unknown flow - send to supervisor
        send_to_supervisor(pkt);
        cache->stats.unknowns++;
    }
}

This caching approach leverages temporal locality. If we see a flow once, we’ll likely see it again soon. By caching flow policies locally on line cards, we eliminate backplane traffic for cached flows.

Cache Management

The cache needs intelligent management:

#define LINECARD_CACHE_SIZE 2048  // Flows per line card

typedef struct flow_cache_entry {
    wwpn_t flow_key;
    flow_policy_t policy;
    timestamp_t last_access;
    uint64_t packet_count;
    bool valid;
} flow_cache_entry_t;

// LRU eviction when cache is full
void evict_lru_entry(linecard_flow_cache_t *cache) {
    timestamp_t oldest = UINT64_MAX;
    uint32_t victim_index = 0;

    for (int i = 0; i < cache->cache_size; i++) {
        if (cache->entries[i].last_access < oldest) {
            oldest = cache->entries[i].last_access;
            victim_index = i;
        }
    }

    // Invalidate victim
    cache->entries[victim_index].valid = false;
    cache->stats.evictions++;
}

// Proactive cache warming
void warm_cache_from_supervisor(linecard_flow_cache_t *cache) {
    // Get top N most active flows from supervisor
    flow_list_t *hot_flows = get_hot_flows_from_supervisor(cache->cache_size);

    for (int i = 0; i < hot_flows->count; i++) {
        insert_local_cache(cache, &hot_flows->flows[i]);
    }

    cache->stats.warmings++;
}

Cache warming at startup dramatically improves hit rates during the critical first minutes after line card insertion or reboot.

Optimization 2: Backplane Traffic Minimization

Communication between line cards and supervisor is expensive. I implemented several techniques to minimize it:

Batched Updates

typedef struct update_batch {
    flow_update_t updates[BATCH_SIZE];
    uint32_t count;
    timestamp_t batch_start_time;
} update_batch_t;

void send_statistics_update(linecard_context_t *lc, flow_stats_t *stats) {
    // Add to batch
    lc->update_batch.updates[lc->update_batch.count++] = *stats;

    timestamp_t now = get_time_ms();
    uint32_t batch_age = now - lc->update_batch.batch_start_time;

    // Flush if batch is full or old
    if (lc->update_batch.count >= BATCH_SIZE || batch_age > 100) {
        // Send entire batch in one message
        send_batch_to_supervisor(&lc->update_batch);

        // Reset batch
        lc->update_batch.count = 0;
        lc->update_batch.batch_start_time = now;
    }
}

Batching reduced supervisor message rate from 200K/sec to 15K/sec (93% reduction).

Compression

// Compress flow updates before sending
void send_compressed_batch(update_batch_t *batch) {
    // Delta encoding: send only changes
    compressed_batch_t compressed = {0};

    for (int i = 0; i < batch->count; i++) {
        flow_update_t *update = &batch->updates[i];

        // Only include changed fields
        if (update->packets_changed) {
            add_packet_delta(&compressed, update->flow_key,
                           update->packet_delta);
        }

        if (update->bytes_changed) {
            add_byte_delta(&compressed, update->flow_key,
                         update->byte_delta);
        }
    }

    // Compress with LZ4 (fast, decent ratio)
    size_t original_size = sizeof(update_batch_t);
    size_t compressed_size = LZ4_compress_default(
        &compressed, output_buffer, original_size, MAX_COMPRESSED_SIZE);

    // Send compressed data
    send_to_supervisor(output_buffer, compressed_size);

    // Track compression ratio
    float ratio = (float)original_size / compressed_size;
    update_compression_stats(ratio);
}

Compression reduced bandwidth by 65% on typical workloads.

Optimization 3: Platform-Specific Features

The N7000 has unique hardware features we can leverage:

Hardware Flow Acceleration

The F3 line cards have hardware flow recognition:

// Program hardware to recognize and cache flows
void program_hardware_flow(linecard_context_t *lc, flow_entry_t *flow) {
    // Hardware flow entry
    hw_flow_entry_t hw_entry = {
        .src_fcid = flow->src_fcid,
        .dst_fcid = flow->dst_fcid,
        .action = flow->policy.action,
        .output_port = flow->policy.output_port,
        .qos_class = flow->policy.qos_class,

        // Hardware statistics
        .hw_packet_counter = 0,
        .hw_byte_counter = 0
    };

    // Install in hardware TCAM
    install_hw_flow_entry(lc->hw_table, &hw_entry);
}

// Periodically sync hardware counters
void sync_hardware_counters(linecard_context_t *lc) {
    for (int i = 0; i < lc->hw_table->num_entries; i++) {
        hw_flow_entry_t *hw = &lc->hw_table->entries[i];

        if (hw->valid) {
            // Read hardware counters
            uint64_t hw_packets = read_hw_counter(hw->packet_counter_addr);
            uint64_t hw_bytes = read_hw_counter(hw->byte_counter_addr);

            // Update software state
            update_flow_counters(hw->flow_key, hw_packets, hw_bytes);

            // Reset hardware counters
            write_hw_counter(hw->packet_counter_addr, 0);
            write_hw_counter(hw->byte_counter_addr, 0);
        }
    }
}

Hardware acceleration moved per-packet processing from software to ASICs, dramatically improving throughput.

NUMA-Aware Memory Allocation

The N7000 supervisor has NUMA architecture:

void* allocate_numa_aware(size_t size, int target_cpu) {
    // Allocate memory on the NUMA node closest to target CPU
    int target_node = cpu_to_node(target_cpu);

    void *mem = numa_alloc_onnode(size, target_node);

    if (mem == NULL) {
        // Fallback to any node
        mem = numa_alloc_local(size);
    }

    return mem;
}

// Initialize per-CPU data structures
void init_per_cpu_structures() {
    int num_cpus = get_nprocs();

    for (int cpu = 0; cpu < num_cpus; cpu++) {
        // Allocate on correct NUMA node
        per_cpu_data[cpu] = allocate_numa_aware(
            sizeof(cpu_data_t), cpu);

        // Pin processing thread to this CPU
        pin_thread_to_cpu(worker_threads[cpu], cpu);
    }
}

NUMA-aware allocation reduced cross-node memory access by 80%, improving latency.

Optimization 4: Multi-Tenancy Support

The N7000 typically runs multiple VDCs (Virtual Device Contexts), requiring isolation:

typedef struct vdc_context {
    vdc_id_t id;
    char name[64];

    // Per-VDC flow table
    flow_table_t *flow_table;

    // Resource limits
    uint32_t max_flows;
    uint32_t max_bandwidth_gbps;

    // Statistics
    atomic_uint64_t packets_processed;
    atomic_uint64_t bytes_processed;

    // Isolation
    network_namespace_t *netns;
} vdc_context_t;

void process_packet_with_vdc_isolation(packet_t *pkt) {
    // Determine VDC from ingress port
    vdc_id_t vdc = get_vdc_for_port(pkt->ingress_port);

    // Switch to VDC context
    vdc_context_t *ctx = get_vdc_context(vdc);

    // Enter VDC network namespace (for isolation)
    enter_netns(ctx->netns);

    // Process within VDC context
    flow_entry_t *flow = lookup_flow(ctx->flow_table, pkt->flow_key);

    if (flow != NULL) {
        // Apply VDC-specific policy
        apply_flow_policy(pkt, flow);

        // Update VDC statistics
        atomic_fetch_add(&ctx->packets_processed, 1);
        atomic_fetch_add(&ctx->bytes_processed, pkt->size);

        // Check VDC resource limits
        if (check_vdc_bandwidth_limit(ctx, pkt->size)) {
            forward_packet(pkt);
        } else {
            drop_packet(pkt);
            ctx->stats.bandwidth_limit_drops++;
        }
    }

    // Exit VDC network namespace
    exit_netns();
}

VDC isolation ensures tenants can’t interfere with each other.

Results

After three months of optimization, N7000 performance improved dramatically:

Before Optimization:

Throughput: 1.8M packets/sec
Latency: 4.5μs P99
CPU utilization: 78%
Cache hit rate: N/A (no caching)

After Optimization:

Throughput: 2.7M packets/sec (+50%)
Latency: 2.3μs P99 (-49%)
CPU utilization: 52% (-33%)
Cache hit rate: 94%

We achieved near-parity with dedicated storage switches while supporting the N7000’s multi-tenant, converged architecture.

Performance Breakdown

Where did the improvements come from?

Distributed caching: +35% throughput (eliminated backplane crossing)
Hardware acceleration: +20% throughput (ASIC processing)
Batching and compression: -40% supervisor CPU (reduced message rate)
NUMA optimization: -15% latency (better memory locality)

Lessons Learned

This optimization effort reinforced several principles:

1. Architecture Matters More Than Code

The single biggest improvement came from architectural change (distributed caching), not code-level optimization. Understanding platform architecture guides effective optimization.

2. Leverage Platform Strengths

The N7000’s distributed line cards initially seemed like a challenge. By embracing the architecture and caching locally, we turned it into an advantage.

3. Minimize Inter-Component Communication

In distributed systems, communication is expensive. Caching, batching, and compression all reduce communication overhead.

4. Hardware Acceleration Is Worth Complexity

Programming hardware flow tables added complexity, but moving per-packet processing from software to ASICs was worth it.

5. NUMA Awareness Matters at Scale

On modern multi-socket systems, memory locality significantly impacts performance. NUMA-aware allocation is essential.

Looking Forward

The N7000 optimizations have broader implications:

Convergence Benefits

Running FC-Redirect on a converged platform (Ethernet + FC) enables new use cases:

Unified fabric management
Coordinated QoS across FC and Ethernet
Simplified datacenter architecture

Scalability Path

The distributed caching model scales naturally. As we add line cards, cache capacity grows proportionally. This provides a path to massive scale.

Multi-Tenancy Foundation

The VDC isolation work enables cloud-style multi-tenancy for storage networking. Multiple customers can share infrastructure with strong isolation guarantees.

Conclusion

Optimizing FC-Redirect for the N7000 required rethinking our architectural assumptions. The distributed, multi-tenant, converged nature of the platform demanded different approaches than dedicated storage switches.

By embracing the N7000’s architecture rather than fighting it, we achieved excellent performance while unlocking new capabilities like multi-tenancy and fabric convergence.

Platform-specific optimization is often dismissed as “not portable,” but in performance-critical systems, leveraging platform strengths is essential. The challenge is balancing portability with performance.

The N7000 work demonstrates that storage networking can thrive on converged platforms. As datacenter infrastructure continues converging, this optimization work positions us well for the future.

Sometimes the best optimization isn’t making your code faster; it’s redesigning your architecture to fit the platform. That’s the lesson from the N7000 project.