Cisco is preparing to launch the MDS 9700, the next generation of our storage networking platform. Iβve had early access to the architecture specifications and have been exploring what it means for FC-Redirect. The capabilities are impressive, but they require rethinking some of our fundamental assumptions.
MDS 9700 Architecture Overview
The 9700 represents a significant leap forward:
Hardware Capabilities:
- 768 ports per chassis (4x previous generation)
- 32 Gbps Fibre Channel support
- 96-core processing capability
- DDR4 memory with 2x bandwidth
- PCIe Gen3 internal fabric
- Advanced ASIC with programmable pipeline
Key Differences from Previous Platforms:
MDS 9250i (Current): MDS 9700 (Next-Gen):
βββββββββββββββββββ ββββββββββββββββββββββββ
β 8-core x86 β β 96-core x86 β
β 48 ports β β 768 ports β
β 16 Gbps FC β β 32 Gbps FC β
β DDR3 Memory β β DDR4 Memory β
β PCIe Gen2 β β PCIe Gen3 β
βββββββββββββββββββ ββββββββββββββββββββββββ
Scaling factor: 16x ports, 12x cores, 2x speed
The scale increase is massive. FC-Redirect must evolve to leverage this.
Challenge 1: Scaling to 768 Ports
Our current flow tracking architecture assumes ~100 ports max. 768 ports changes everything.
Port-Based Sharding
Iβm implementing port-based sharding to distribute load:
#define MDS9700_MAX_PORTS 768
#define FLOWS_PER_PORT_SHARD 1000
typedef struct port_shard {
port_id_t port_id;
flow_table_t *flow_table;
pthread_t processing_thread;
cpu_set_t cpu_affinity;
// Per-port statistics
atomic_uint64_t packets_processed;
atomic_uint64_t flows_active;
} port_shard_t;
typedef struct mds9700_context {
port_shard_t shards[MDS9700_MAX_PORTS];
int num_shards;
// Global flow index for cross-port lookups
global_flow_index_t *global_index;
} mds9700_context_t;
void init_port_sharding(mds9700_context_t *ctx, int num_ports) {
ctx->num_shards = num_ports;
for (int i = 0; i < num_ports; i++) {
port_shard_t *shard = &ctx->shards[i];
shard->port_id = i;
shard->flow_table = create_flow_table(FLOWS_PER_PORT_SHARD);
// Pin to specific CPU cores
int cpu = i % 96; // Spread across 96 cores
CPU_ZERO(&shard->cpu_affinity);
CPU_SET(cpu, &shard->cpu_affinity);
// Create processing thread
pthread_create(&shard->processing_thread, NULL,
port_processing_thread, shard);
pthread_setaffinity_np(shard->processing_thread,
sizeof(cpu_set_t),
&shard->cpu_affinity);
}
}
This architecture scales linearly with port count.
NUMA-Aware Memory
With 96 cores, NUMA becomes critical:
void allocate_port_shard_numa_aware(port_shard_t *shard, int cpu) {
// Determine NUMA node for this CPU
int numa_node = cpu / 24; // 4 NUMA nodes, 24 cores each
// Allocate shard data on correct NUMA node
numa_set_preferred(numa_node);
shard->flow_table = create_flow_table(FLOWS_PER_PORT_SHARD);
shard->packet_buffers = allocate_packet_buffers(BUFFER_SIZE);
// Verify allocation on correct node
int actual_node = numa_node_of_cpu(cpu);
assert(actual_node == numa_node);
}
NUMA-aware allocation prevents expensive cross-node memory access.
Challenge 2: 32 Gbps Fibre Channel
32 Gbps FC means 2x packet rate compared to 16 Gbps:
Performance Requirements:
- 16 Gbps: ~4M packets/sec
- 32 Gbps: ~8M packets/sec
Our current implementation maxes out at 2.9M packets/sec. We need major optimizations.
Hardware Offload
The 9700βs ASIC supports programmable packet processing:
// Program ASIC for fast-path flow processing
typedef struct asic_flow_entry {
fc_id_t src_fcid;
fc_id_t dst_fcid;
// Action
enum {
ASIC_ACTION_FORWARD,
ASIC_ACTION_REDIRECT,
ASIC_ACTION_DROP,
ASIC_ACTION_MIRROR
} action;
port_id_t output_port;
uint8_t qos_class;
// Counters (updated in hardware)
uint64_t packet_count;
uint64_t byte_count;
} asic_flow_entry_t;
bool program_asic_flow(asic_interface_t *asic,
flow_entry_t *flow) {
asic_flow_entry_t hw_entry = {
.src_fcid = flow->src_fcid,
.dst_fcid = flow->dst_fcid,
.action = translate_action(flow->policy.action),
.output_port = flow->policy.output_port,
.qos_class = flow->policy.qos_class
};
// Install in ASIC flow table (TCAM)
return asic_install_flow_entry(asic, &hw_entry);
}
Hardware processing handles the fast path at line rate. Software handles exceptions.
Batch Processing
Process packets in batches for better cache utilization:
#define BATCH_SIZE 64
typedef struct packet_batch {
packet_t *packets[BATCH_SIZE];
int count;
} packet_batch_t;
void process_packet_batch(packet_batch_t *batch) {
// Prefetch flow entries
for (int i = 0; i < batch->count; i++) {
wwpn_t key = batch->packets[i]->flow_key;
flow_entry_t *flow = lookup_flow_prefetch(key);
__builtin_prefetch(flow, 0, 3); // Prefetch for read
}
// Process all packets in batch
for (int i = 0; i < batch->count; i++) {
packet_t *pkt = batch->packets[i];
flow_entry_t *flow = lookup_flow(pkt->flow_key);
// Flow already in cache from prefetch
apply_flow_policy(pkt, flow);
forward_packet(pkt);
}
}
Batching + prefetching reduces cache misses significantly.
Challenge 3: Massive Flow Scale
768 ports Γ 1000 flows/port = 768K potential concurrent flows:
Hierarchical Flow Tables
Flat hash tables donβt scale to 768K entries efficiently. Iβm implementing hierarchical tables:
typedef struct hierarchical_flow_table {
// Level 1: Per-port tables (hot data)
flow_table_t *port_tables[MDS9700_MAX_PORTS];
// Level 2: Global overflow table (cold data)
flow_table_t *overflow_table;
// Level 3: Disk-backed archive (historical)
flow_archive_t *archive;
// Statistics
atomic_uint64_t l1_hits;
atomic_uint64_t l2_hits;
atomic_uint64_t l3_hits;
} hierarchical_flow_table_t;
flow_entry_t* hierarchical_lookup(hierarchical_flow_table_t *table,
port_id_t port,
wwpn_t flow_key) {
// L1: Check port-local table (hot path)
flow_entry_t *flow = flow_table_lookup(table->port_tables[port],
flow_key);
if (flow != NULL) {
atomic_fetch_add(&table->l1_hits, 1);
return flow;
}
// L2: Check global overflow
flow = flow_table_lookup(table->overflow_table, flow_key);
if (flow != NULL) {
atomic_fetch_add(&table->l2_hits, 1);
// Promote to L1 if frequently accessed
if (should_promote_to_l1(flow)) {
promote_flow_to_l1(table, port, flow);
}
return flow;
}
// L3: Check archive (rare)
flow = flow_archive_lookup(table->archive, flow_key);
if (flow != NULL) {
atomic_fetch_add(&table->l3_hits, 1);
// Restore to L2
flow_table_insert(table->overflow_table, flow);
return flow;
}
return NULL;
}
This hierarchy keeps hot data in fast tables and cold data in slower storage.
Challenge 4: Advanced Telemetry
The 9700 supports in-band telemetry (INT - Inband Network Telemetry):
Telemetry Integration
typedef struct flow_telemetry {
flow_id_t flow_id;
// Per-hop metrics
struct hop_telemetry {
device_id_t device;
port_id_t ingress_port;
port_id_t egress_port;
timestamp_t timestamp;
uint32_t queue_depth;
uint32_t latency_ns;
} hops[MAX_HOPS];
int num_hops;
} flow_telemetry_t;
void process_telemetry_packet(packet_t *pkt) {
// Extract INT data from packet
flow_telemetry_t telemetry = {0};
extract_int_data(pkt, &telemetry);
// Update flow with telemetry data
flow_entry_t *flow = lookup_flow(telemetry.flow_id);
if (flow != NULL) {
update_flow_telemetry(flow, &telemetry);
// Detect anomalies
if (detect_congestion(&telemetry)) {
trigger_congestion_avoidance(flow);
}
if (detect_asymmetric_routing(&telemetry)) {
log_routing_asymmetry(flow);
}
}
}
INT provides unprecedented visibility into flow behavior.
Challenge 5: Energy Efficiency
768 ports draw significant power. The 9700 supports dynamic power management:
typedef struct port_power_state {
port_id_t port;
enum {
POWER_FULL, // 100% - active traffic
POWER_REDUCED, // 60% - idle but ready
POWER_SUSPEND, // 10% - no traffic expected
POWER_OFF // 0% - administratively down
} state;
timestamp_t last_packet_time;
uint32_t idle_time_sec;
} port_power_state_t;
void manage_port_power(port_power_state_t *port) {
uint32_t idle_time = time_since_last_packet(port);
if (port->state == POWER_FULL && idle_time > 60) {
// No traffic for 60 seconds, reduce power
set_port_power_state(port, POWER_REDUCED);
} else if (port->state == POWER_REDUCED && idle_time > 300) {
// No traffic for 5 minutes, suspend
set_port_power_state(port, POWER_SUSPEND);
} else if (port->state == POWER_SUSPEND && idle_time > 3600) {
// No traffic for 1 hour, consider shutdown
if (can_safely_shutdown_port(port)) {
set_port_power_state(port, POWER_OFF);
}
}
// Wake on packet arrival
if (packet_arrived(port) && port->state != POWER_FULL) {
set_port_power_state(port, POWER_FULL);
}
}
Intelligent power management reduces operating costs significantly.
Testing at Scale
Testing FC-Redirect at 768-port scale requires new approaches:
Simulation Framework
typedef struct port_simulator {
port_id_t port_id;
traffic_generator_t *traffic_gen;
// Simulated load
uint32_t packets_per_second;
uint32_t flows_active;
traffic_pattern_t pattern;
} port_simulator_t;
void simulate_mds9700_load(int num_ports) {
port_simulator_t sims[num_ports];
// Configure realistic load per port
for (int i = 0; i < num_ports; i++) {
sims[i].port_id = i;
sims[i].packets_per_second = 10000; // 10K pps per port
sims[i].flows_active = 100;
sims[i].pattern = PATTERN_BURSTY;
start_port_simulator(&sims[i]);
}
// Total: 768 ports Γ 10K pps = 7.68M pps
// Monitor system behavior under load
monitor_system_health(3600); // 1 hour test
// Collect results
for (int i = 0; i < num_ports; i++) {
port_stats_t stats = get_port_stats(&sims[i]);
analyze_port_performance(&stats);
}
}
Preliminary Results
Early testing on 9700 prototype hardware shows:
Performance:
- Throughput: 7.2M packets/sec (2.5x improvement)
- Latency: 1.8ΞΌs P99 (slight improvement)
- Flow capacity: 500K concurrent flows (42x improvement)
Efficiency:
- CPU utilization: 65% at max load (good headroom)
- Memory: 48GB for 500K flows (0.1MB per flow)
- Power: 2800W full load, 1400W typical (50% savings vs always-on)
Lessons for Next-Gen Platform Support
This work has taught me:
1. Scale Changes Everything
Techniques that work at small scale fail at large scale. Hierarchical approaches become essential.
2. Hardware Offload Is Critical
At 8M pps, software canβt keep up. Hardware acceleration is necessary for line-rate processing.
3. NUMA Awareness Is Mandatory
With many cores and NUMA architecture, memory locality determines performance.
4. Power Management Matters
At datacenter scale, power costs rival hardware costs. Efficiency is a first-class requirement.
5. Testing at Scale Is Hard
Simulating 768-port load requires specialized tooling and infrastructure.
Looking Forward
The MDS 9700 launches later this year. FC-Redirect will be ready to leverage its capabilities:
- 10x scale increase
- 2.5x performance improvement
- Advanced telemetry integration
- Significant power efficiency
This platform will serve customers for the next decade. The architectural work weβre doing now ensures FC-Redirect can grow with customer demands.
Next-generation platforms create opportunities to rethink assumptions and build better systems. The 9700 is such an opportunity.
As storage networks continue growing in scale and sophistication, platforms like the 9700 provide the foundation for meeting customer needs. Iβm excited to see what customers build on this platform.
The future of storage networking is more scale, more speed, and more intelligence. The MDS 9700 delivers all three.