Modern data centers demand network fabrics that can scale massively while providing consistent, predictable performance. Traditional designs are reaching their limits. At Cisco, we’re seeing customers adopt new fabric architectures. Let me explore what makes a good fabric and emerging design patterns.
What Makes a Good Fabric?
A data center fabric should provide:
Predictable Latency: The latency between any two endpoints should be consistent and low.
High Bandwidth: Sufficient bandwidth to handle peak loads without congestion.
Horizontal Scalability: Add capacity by adding nodes, not upgrading existing nodes.
No Single Point of Failure: Redundancy at every level.
Simplified Management: Easy to configure, monitor, and troubleshoot.
Support for Multiple Protocols: Handle Ethernet, FC, FCoE, and other protocols as needed.
These requirements drive fabric architecture decisions.
Traditional Three-Tier Limitations
The traditional access-aggregation-core model has limitations:
Oversubscription: Each tier typically has oversubscription, which compounds. You might have 20:1 or higher oversubscription from access to core.
Spanning Tree: Blocks links to prevent loops, wasting bandwidth. Only one path is active even with physical redundancy.
East-West Traffic: Optimized for north-south traffic (client to server). Modern applications have much more east-west traffic (server to server).
Scale Limits: Adding capacity requires upgrading core, which has practical limits.
These limitations drive interest in alternative architectures.
Leaf-Spine Architecture
The leaf-spine (or Clos) architecture addresses many limitations:
Structure:
- Spine layer: High-capacity switches forming the backbone
- Leaf layer: Access switches where servers connect
- Every leaf connects to every spine
Benefits:
- Predictable latency: Always same number of hops (leaf-spine-leaf)
- High bandwidth: Multiple paths between any two servers
- Horizontal scaling: Add spine switches for more capacity, add leaf switches for more ports
- No spanning tree: All links are active with proper routing
Oversubscription: Still exists but is more controlled and predictable. Typical designs are 3:1 or 4:1 oversubscription.
This architecture is becoming dominant for large data centers.
Fabric Routing
Traditional Ethernet uses spanning tree to prevent loops. Fabric architectures use routing:
ECMP (Equal-Cost Multi-Path): Distributes traffic across multiple equal-cost paths. All links are used, not just one.
TRILL (Transparent Interconnection of Lots of Links): Provides Layer 2 multipathing using IS-IS routing protocol.
FabricPath: Cisco’s implementation of similar concepts, providing Layer 2 multipathing.
Shortest Path Bridging (SPB): IEEE standard for multipath bridging.
These technologies eliminate spanning tree’s limitations, enabling full utilization of all links.
FC Fabric Design
For Fibre Channel, fabric design has different considerations:
Core-Edge: Traditional FC design with edge switches where devices connect and core switches providing connectivity.
Meshed Directors: For maximum capacity, fully mesh multiple directors.
Dual Fabrics: Always deploy two completely independent fabrics for redundancy.
ISL Design: Inter-switch links should be sized to prevent oversubscription during failures.
VSAN: Virtual SANs provide isolation within a physical fabric, similar to VLANs.
At Cisco, our MDS switches provide the foundation for scalable FC fabrics.
Converged Fabrics with FCoE
FCoE enables converged fabrics carrying both Ethernet and FC:
Benefits:
- Fewer cables and adapters
- Unified management
- Better utilization through sharing
Requirements:
- Data Center Bridging (DCB) for lossless Ethernet
- FCoE-capable switches (Nexus series)
- Converged Network Adapters (CNAs)
Design Considerations:
- Plan bandwidth carefully—both LAN and SAN traffic use same links
- Implement QoS to prioritize storage traffic
- Maintain separation in logical fabrics even on converged physical fabric
Converged fabrics work well for greenfield deployments and highly virtualized environments.
Buffer Management
Buffers are critical for fabric performance:
Credit-Based Flow Control: FC uses credits to prevent frame loss. Ensure adequate credits for link distance and speed.
Priority Flow Control (PFC): For FCoE, PFC provides lossless delivery for storage traffic while allowing other traffic to be lossy.
Buffer Sizing: Switches need adequate buffers to handle microbursts without drops.
Buffer Fairness: Ensure buffers are shared fairly across ports and traffic classes.
Poor buffer management causes performance problems that are hard to diagnose.
Fabric Convergence
When topology changes (link failure, switch addition), the fabric must reconverge:
Convergence Time: How long until all switches agree on new topology and update forwarding tables.
Traffic Impact: What happens to in-flight traffic during reconvergence.
Stability: Prevent flapping and oscillation during convergence.
Modern fabrics aim for sub-second convergence with minimal traffic loss.
Multi-Tenant Fabrics
Data centers increasingly support multiple tenants:
Isolation: Segregate traffic and configuration between tenants.
Resource Allocation: Allocate bandwidth, ports, VLANs per tenant.
Security: Prevent one tenant from accessing another’s traffic.
Chargeback: Meter usage for billing purposes.
Technologies enabling this include VRFs, VLANs, VSANs, and resource pools.
Automation and Orchestration
Large fabrics require automation:
Automated Provisioning: Configure ports, VLANs, zoning via APIs or scripts.
Template-Based Configuration: Use templates for consistent configuration.
Configuration Management: Track configuration changes and maintain consistency.
Monitoring: Automated collection and analysis of metrics.
Manual configuration doesn’t scale to thousands of ports.
Monitoring and Visibility
You can’t manage what you can’t see:
Traffic Analysis: Understand traffic patterns, top talkers, application mix.
Performance Metrics: Latency, drops, errors, utilization across all links.
Topology Visibility: Clear view of fabric topology and status.
Alerting: Proactive alerts for problems before they impact applications.
Historical Analysis: Trend analysis for capacity planning.
Cisco’s fabric management tools provide visibility into MDS and Nexus fabrics.
High Availability Design
Fabric HA requires careful planning:
Dual Fabrics: Independent fabrics with no shared components.
Redundant Paths: Multiple paths between all endpoints.
Fast Failover: Sub-second detection and failover for link/switch failures.
Maintenance Mode: Support for non-disruptive maintenance.
Failure Domain Isolation: Prevent single failures from cascading.
Test failover scenarios regularly. Untested HA doesn’t work when you need it.
Capacity Planning
Plan fabric capacity carefully:
Bandwidth: Sufficient bandwidth for peak loads plus headroom for growth.
Port Count: Enough ports for current needs plus expansion.
Oversubscription: Understand and plan for acceptable oversubscription ratios.
Growth Projection: Plan for 3-5 years of growth.
Performance Headroom: Don’t plan to 100% utilization—leave 30-40% headroom.
Under-provisioning fabrics creates problems that are expensive to fix later.
Troubleshooting Fabric Issues
Common fabric problems:
Misconfiguration: Wrong VLANs, incorrect routing, zoning errors.
Physical Issues: Bad cables, SFPs, or connectors.
Congestion: Insufficient bandwidth for traffic load.
Routing Loops: Misconfigured routing creating loops.
Software Bugs: Firmware bugs in switches.
Systematic troubleshooting and good diagnostic tools are essential.
QoS in Fabrics
Quality of Service ensures critical traffic gets priority:
Classification: Identify traffic types (storage, VM, management, etc.).
Marking: Mark packets/frames with priority.
Queuing: Multiple queues with different priorities.
Scheduling: Algorithms for serving queues (strict priority, weighted fair queuing).
Rate Limiting: Prevent low-priority traffic from consuming all bandwidth.
For converged fabrics, QoS is essential to prevent VM traffic from impacting storage traffic.
Security Considerations
Fabric security is often overlooked:
Access Control: Restrict management access to authorized administrators.
Zoning: FC zoning prevents unauthorized storage access.
Port Security: Prevent rogue devices from connecting.
Encryption: For traffic crossing untrusted networks.
Audit Logging: Log all configuration changes for accountability.
Security should be designed in, not added later.
Migration Strategies
Migrating to new fabric architecture is challenging:
Phased Approach: Migrate in stages, not all at once.
Dual Fabrics: Run old and new fabrics in parallel during transition.
Testing: Extensive testing before production cutover.
Rollback Plan: Clear plan for rolling back if problems occur.
Training: Ensure team understands new architecture before migration.
Large-scale fabric migrations are high-risk and require careful planning.
Emerging Technologies
Technologies shaping fabric evolution:
Software-Defined Networking (SDN): Separating control plane from data plane.
Overlay Networks: VXLAN and similar technologies enabling massive scale.
Programmable Fabrics: API-driven configuration and management.
White Box Switches: Commodity hardware with open network OS.
Intent-Based Networking: Declare desired outcomes, let system handle implementation.
These technologies are still maturing but represent the future direction.
Lessons from Experience
Key lessons from building fabrics:
Simplicity: Simpler designs are more reliable than complex ones.
Consistency: Consistent configuration across fabric elements reduces errors.
Documentation: Document design decisions and configurations.
Testing: Test thoroughly before production, especially failover scenarios.
Monitoring: Comprehensive monitoring prevents surprises.
Capacity: Over-provision rather than under-provision.
Conclusion
Building scalable data center fabrics requires understanding both traditional and emerging architectures. The move from hierarchical to leaf-spine designs reflects changing traffic patterns and scale requirements.
Key principles:
- Design for horizontal scalability
- Eliminate bottlenecks and single points of failure
- Use routing instead of spanning tree
- Plan capacity with headroom
- Automate configuration and monitoring
- Test failover scenarios thoroughly
Working on FC-Redirect has deepened my appreciation for robust fabric design. Storage virtualization relies on healthy, performant fabrics. When the fabric has problems, everything has problems.
The fabric is the foundation of the data center. Invest in getting it right, and everything built on top becomes easier. Skimp on fabric design, and you’ll fight problems forever.
As data centers grow and evolve, fabric architecture becomes increasingly important. Understanding these concepts is essential for anyone designing or operating modern data center infrastructure.