The Nexus 7000 represents a different approach to data center networking compared to dedicated storage switches. As a modular platform supporting Ethernet, Fibre Channel, and FCoE, it requires careful optimization to deliver FC-Redirect performance competitive with purpose-built storage switches.
Understanding the N7000 Architecture
The N7000βs architecture is fundamentally different from the MDS family:
Hardware Architecture
N7000 Components:
βββββββββββββββββββββββββββββββββββββββ
β Supervisor Module β
β (Control Plane - x86 Linux) β
ββββββββββββ¬βββββββββββββββββββββββββββ
β
ββββββββ΄βββββββ¬βββββββββββ¬βββββββββ
β β β β
βββββΌβββββ ββββββΌββββ βββββΌβββββ ...
β Line β β Line β β Line β
β Card 1 β β Card 2 β β Card 3 β
β(F3/M3) β β(F3/M3) β β(F3/M3) β
ββββββββββ ββββββββββ ββββββββββ
Key characteristics:
- Distributed forwarding: Each line card has local ASICs and CPU
- Centralized control: Supervisor runs control plane software
- Mixed traffic: FC, FCoE, and Ethernet on same chassis
- Modular design: Line cards can be added/removed
This distributed architecture creates both challenges and opportunities.
The Performance Challenge
Initial FC-Redirect port to N7000 showed disappointing results:
MDS 9250i (baseline):
- Throughput: 2.9M packets/sec
- Latency: 2.0ΞΌs P99
- CPU utilization: 45%
N7000 (initial port):
- Throughput: 1.8M packets/sec (38% slower)
- Latency: 4.5ΞΌs P99 (2.25x worse)
- CPU utilization: 78%
We had significant optimization work ahead.
Optimization 1: Distributed Flow Processing
The N7000βs distributed architecture led me to rethink our centralized flow processing model.
Original Architecture (Centralized)
// All flow processing on supervisor
void process_packet_centralized(packet_t *pkt) {
// Packet arrives at line card
// Line card sends to supervisor
// Supervisor looks up flow
// Supervisor sends decision back to line card
// Line card forwards packet
// Result: 2 inter-module communications per packet
}
This incurred massive overhead: packets crossed the backplane twice per operation.
Optimized Architecture (Distributed)
// Flow processing distributed to line cards
typedef struct linecard_flow_cache {
flow_entry_t *hot_flows; // Recently used flows
uint32_t cache_size;
cache_stats_t stats;
} linecard_flow_cache_t;
void process_packet_distributed(packet_t *pkt, linecard_flow_cache_t *cache) {
// Check line card local cache
flow_entry_t *flow = lookup_local_cache(cache, pkt->flow_key);
if (flow != NULL) {
// Cache hit - process locally, no backplane crossing
apply_flow_policy(pkt, flow);
forward_packet_locally(pkt);
cache->stats.hits++;
return;
}
// Cache miss - query supervisor
flow = query_supervisor_flow(pkt->flow_key);
if (flow != NULL) {
// Add to local cache
insert_local_cache(cache, flow);
apply_flow_policy(pkt, flow);
forward_packet_locally(pkt);
cache->stats.misses++;
} else {
// Unknown flow - send to supervisor
send_to_supervisor(pkt);
cache->stats.unknowns++;
}
}
This caching approach leverages temporal locality. If we see a flow once, weβll likely see it again soon. By caching flow policies locally on line cards, we eliminate backplane traffic for cached flows.
Cache Management
The cache needs intelligent management:
#define LINECARD_CACHE_SIZE 2048 // Flows per line card
typedef struct flow_cache_entry {
wwpn_t flow_key;
flow_policy_t policy;
timestamp_t last_access;
uint64_t packet_count;
bool valid;
} flow_cache_entry_t;
// LRU eviction when cache is full
void evict_lru_entry(linecard_flow_cache_t *cache) {
timestamp_t oldest = UINT64_MAX;
uint32_t victim_index = 0;
for (int i = 0; i < cache->cache_size; i++) {
if (cache->entries[i].last_access < oldest) {
oldest = cache->entries[i].last_access;
victim_index = i;
}
}
// Invalidate victim
cache->entries[victim_index].valid = false;
cache->stats.evictions++;
}
// Proactive cache warming
void warm_cache_from_supervisor(linecard_flow_cache_t *cache) {
// Get top N most active flows from supervisor
flow_list_t *hot_flows = get_hot_flows_from_supervisor(cache->cache_size);
for (int i = 0; i < hot_flows->count; i++) {
insert_local_cache(cache, &hot_flows->flows[i]);
}
cache->stats.warmings++;
}
Cache warming at startup dramatically improves hit rates during the critical first minutes after line card insertion or reboot.
Optimization 2: Backplane Traffic Minimization
Communication between line cards and supervisor is expensive. I implemented several techniques to minimize it:
Batched Updates
typedef struct update_batch {
flow_update_t updates[BATCH_SIZE];
uint32_t count;
timestamp_t batch_start_time;
} update_batch_t;
void send_statistics_update(linecard_context_t *lc, flow_stats_t *stats) {
// Add to batch
lc->update_batch.updates[lc->update_batch.count++] = *stats;
timestamp_t now = get_time_ms();
uint32_t batch_age = now - lc->update_batch.batch_start_time;
// Flush if batch is full or old
if (lc->update_batch.count >= BATCH_SIZE || batch_age > 100) {
// Send entire batch in one message
send_batch_to_supervisor(&lc->update_batch);
// Reset batch
lc->update_batch.count = 0;
lc->update_batch.batch_start_time = now;
}
}
Batching reduced supervisor message rate from 200K/sec to 15K/sec (93% reduction).
Compression
// Compress flow updates before sending
void send_compressed_batch(update_batch_t *batch) {
// Delta encoding: send only changes
compressed_batch_t compressed = {0};
for (int i = 0; i < batch->count; i++) {
flow_update_t *update = &batch->updates[i];
// Only include changed fields
if (update->packets_changed) {
add_packet_delta(&compressed, update->flow_key,
update->packet_delta);
}
if (update->bytes_changed) {
add_byte_delta(&compressed, update->flow_key,
update->byte_delta);
}
}
// Compress with LZ4 (fast, decent ratio)
size_t original_size = sizeof(update_batch_t);
size_t compressed_size = LZ4_compress_default(
&compressed, output_buffer, original_size, MAX_COMPRESSED_SIZE);
// Send compressed data
send_to_supervisor(output_buffer, compressed_size);
// Track compression ratio
float ratio = (float)original_size / compressed_size;
update_compression_stats(ratio);
}
Compression reduced bandwidth by 65% on typical workloads.
Optimization 3: Platform-Specific Features
The N7000 has unique hardware features we can leverage:
Hardware Flow Acceleration
The F3 line cards have hardware flow recognition:
// Program hardware to recognize and cache flows
void program_hardware_flow(linecard_context_t *lc, flow_entry_t *flow) {
// Hardware flow entry
hw_flow_entry_t hw_entry = {
.src_fcid = flow->src_fcid,
.dst_fcid = flow->dst_fcid,
.action = flow->policy.action,
.output_port = flow->policy.output_port,
.qos_class = flow->policy.qos_class,
// Hardware statistics
.hw_packet_counter = 0,
.hw_byte_counter = 0
};
// Install in hardware TCAM
install_hw_flow_entry(lc->hw_table, &hw_entry);
}
// Periodically sync hardware counters
void sync_hardware_counters(linecard_context_t *lc) {
for (int i = 0; i < lc->hw_table->num_entries; i++) {
hw_flow_entry_t *hw = &lc->hw_table->entries[i];
if (hw->valid) {
// Read hardware counters
uint64_t hw_packets = read_hw_counter(hw->packet_counter_addr);
uint64_t hw_bytes = read_hw_counter(hw->byte_counter_addr);
// Update software state
update_flow_counters(hw->flow_key, hw_packets, hw_bytes);
// Reset hardware counters
write_hw_counter(hw->packet_counter_addr, 0);
write_hw_counter(hw->byte_counter_addr, 0);
}
}
}
Hardware acceleration moved per-packet processing from software to ASICs, dramatically improving throughput.
NUMA-Aware Memory Allocation
The N7000 supervisor has NUMA architecture:
void* allocate_numa_aware(size_t size, int target_cpu) {
// Allocate memory on the NUMA node closest to target CPU
int target_node = cpu_to_node(target_cpu);
void *mem = numa_alloc_onnode(size, target_node);
if (mem == NULL) {
// Fallback to any node
mem = numa_alloc_local(size);
}
return mem;
}
// Initialize per-CPU data structures
void init_per_cpu_structures() {
int num_cpus = get_nprocs();
for (int cpu = 0; cpu < num_cpus; cpu++) {
// Allocate on correct NUMA node
per_cpu_data[cpu] = allocate_numa_aware(
sizeof(cpu_data_t), cpu);
// Pin processing thread to this CPU
pin_thread_to_cpu(worker_threads[cpu], cpu);
}
}
NUMA-aware allocation reduced cross-node memory access by 80%, improving latency.
Optimization 4: Multi-Tenancy Support
The N7000 typically runs multiple VDCs (Virtual Device Contexts), requiring isolation:
typedef struct vdc_context {
vdc_id_t id;
char name[64];
// Per-VDC flow table
flow_table_t *flow_table;
// Resource limits
uint32_t max_flows;
uint32_t max_bandwidth_gbps;
// Statistics
atomic_uint64_t packets_processed;
atomic_uint64_t bytes_processed;
// Isolation
network_namespace_t *netns;
} vdc_context_t;
void process_packet_with_vdc_isolation(packet_t *pkt) {
// Determine VDC from ingress port
vdc_id_t vdc = get_vdc_for_port(pkt->ingress_port);
// Switch to VDC context
vdc_context_t *ctx = get_vdc_context(vdc);
// Enter VDC network namespace (for isolation)
enter_netns(ctx->netns);
// Process within VDC context
flow_entry_t *flow = lookup_flow(ctx->flow_table, pkt->flow_key);
if (flow != NULL) {
// Apply VDC-specific policy
apply_flow_policy(pkt, flow);
// Update VDC statistics
atomic_fetch_add(&ctx->packets_processed, 1);
atomic_fetch_add(&ctx->bytes_processed, pkt->size);
// Check VDC resource limits
if (check_vdc_bandwidth_limit(ctx, pkt->size)) {
forward_packet(pkt);
} else {
drop_packet(pkt);
ctx->stats.bandwidth_limit_drops++;
}
}
// Exit VDC network namespace
exit_netns();
}
VDC isolation ensures tenants canβt interfere with each other.
Results
After three months of optimization, N7000 performance improved dramatically:
Before Optimization:
- Throughput: 1.8M packets/sec
- Latency: 4.5ΞΌs P99
- CPU utilization: 78%
- Cache hit rate: N/A (no caching)
After Optimization:
- Throughput: 2.7M packets/sec (+50%)
- Latency: 2.3ΞΌs P99 (-49%)
- CPU utilization: 52% (-33%)
- Cache hit rate: 94%
We achieved near-parity with dedicated storage switches while supporting the N7000βs multi-tenant, converged architecture.
Performance Breakdown
Where did the improvements come from?
- Distributed caching: +35% throughput (eliminated backplane crossing)
- Hardware acceleration: +20% throughput (ASIC processing)
- Batching and compression: -40% supervisor CPU (reduced message rate)
- NUMA optimization: -15% latency (better memory locality)
Lessons Learned
This optimization effort reinforced several principles:
1. Architecture Matters More Than Code
The single biggest improvement came from architectural change (distributed caching), not code-level optimization. Understanding platform architecture guides effective optimization.
2. Leverage Platform Strengths
The N7000βs distributed line cards initially seemed like a challenge. By embracing the architecture and caching locally, we turned it into an advantage.
3. Minimize Inter-Component Communication
In distributed systems, communication is expensive. Caching, batching, and compression all reduce communication overhead.
4. Hardware Acceleration Is Worth Complexity
Programming hardware flow tables added complexity, but moving per-packet processing from software to ASICs was worth it.
5. NUMA Awareness Matters at Scale
On modern multi-socket systems, memory locality significantly impacts performance. NUMA-aware allocation is essential.
Looking Forward
The N7000 optimizations have broader implications:
Convergence Benefits
Running FC-Redirect on a converged platform (Ethernet + FC) enables new use cases:
- Unified fabric management
- Coordinated QoS across FC and Ethernet
- Simplified datacenter architecture
Scalability Path
The distributed caching model scales naturally. As we add line cards, cache capacity grows proportionally. This provides a path to massive scale.
Multi-Tenancy Foundation
The VDC isolation work enables cloud-style multi-tenancy for storage networking. Multiple customers can share infrastructure with strong isolation guarantees.
Conclusion
Optimizing FC-Redirect for the N7000 required rethinking our architectural assumptions. The distributed, multi-tenant, converged nature of the platform demanded different approaches than dedicated storage switches.
By embracing the N7000βs architecture rather than fighting it, we achieved excellent performance while unlocking new capabilities like multi-tenancy and fabric convergence.
Platform-specific optimization is often dismissed as βnot portable,β but in performance-critical systems, leveraging platform strengths is essential. The challenge is balancing portability with performance.
The N7000 work demonstrates that storage networking can thrive on converged platforms. As datacenter infrastructure continues converging, this optimization work positions us well for the future.
Sometimes the best optimization isnβt making your code faster; itβs redesigning your architecture to fit the platform. Thatβs the lesson from the N7000 project.