VMware has become the dominant virtualization platform, and it places unique demands on storage infrastructure. Over the past year, I’ve seen many VMware deployments at Cisco customers, both excellent and problematic. Let me share best practices for VMware storage that I’ve learned.
Understanding VMware Storage I/O
VMware’s storage I/O pattern differs from traditional workloads:
Randomization: Even if individual VMs have sequential I/O, when multiplexed through the hypervisor, the combined I/O becomes random. This kills performance on traditional storage.
Density: A single server might host 30+ VMs, each with its own I/O pattern. The aggregate I/O demand is substantial.
Variability: VM workloads vary significantly. Database VMs have different characteristics than file server VMs.
Boot Storms: When many VMs boot simultaneously (after maintenance or power event), I/O spikes dramatically.
Understanding these patterns is essential for proper storage design.
Datastore Design
VMFS datastore design significantly impacts performance and manageability:
Datastore Sizing
Small Datastores (500 GB - 1 TB):
- Pros: Easier to manage, better I/O isolation between datastores
- Cons: More datastores to manage, can run out of LUNs
Large Datastores (2+ TB):
- Pros: Fewer to manage, can hold more VMs
- Cons: Less I/O isolation, SCSI locking can become an issue
My recommendation: 1-2 TB datastores as a balance. Not so large that SCSI locking becomes an issue, not so small that management becomes overwhelming.
VMs Per Datastore
Limit VMs per datastore to manage I/O contention and failure blast radius:
Production VMs: 15-20 VMs per datastore maximum Non-Production VMs: 30-40 VMs per datastore acceptable
These are guidelines, not hard rules. Adjust based on VM I/O characteristics and storage performance.
Datastore Tiering
Separate datastores by workload characteristics:
Tier 0: High-performance VMs on all-flash or hybrid arrays with large SSD caches Tier 1: Production VMs on fast spinning disk (15K RPM) Tier 2: Non-production VMs on slower disk (10K or 7.2K RPM) Tier 3: Archives and backups on capacity-optimized storage
Don’t mix tiers in the same datastore. I/O from low-tier VMs impacts high-tier VMs.
Multipathing Configuration
Proper multipathing is critical for both performance and availability:
Path Selection Policy
VMware supports several multipathing policies:
Fixed: Uses a preferred path, fails over only when that path is down. Simple but doesn’t load balance.
Most Recently Used (MRU): Uses one path until it fails, then switches to another. Common for active-passive arrays but doesn’t load balance.
Round Robin (RR): Distributes I/O across all paths. Best for performance and failure detection.
For active-active arrays (most modern arrays), use Round Robin. It provides:
- Load balancing across all paths
- Better aggregate bandwidth
- Faster failure detection
- Better utilization of multipath infrastructure
Path Failover
Configure appropriate timeout values:
IOPS: Lower values (5-10 seconds) detect failures quickly but might cause false positives during transient congestion.
Path Down: How long to wait before declaring a path down. 15-30 seconds is typical.
Test failover scenarios to ensure timeouts are appropriate for your environment.
Queue Depth Tuning
Queue depth limits how many outstanding I/O operations are allowed:
Adapter Queue Depth
The HBA queue depth defaults to 32 or 64. For busy virtualization hosts, this may be too low:
- Modern HBAs support queue depths of 128 or higher
- Increase adapter queue depth to 128 or 256 for better throughput
- Monitor for saturation and adjust accordingly
LUN Queue Depth
Each LUN has a queue depth (default 32). With many VMs on a datastore:
- Calculate: (Number of VMs) × (Expected queue depth per VM)
- If this exceeds 32, increase LUN queue depth
- Don’t set too high—excessive queuing increases latency
Finding the right balance requires monitoring and tuning.
VAAI Benefits
VMware vStorage APIs for Array Integration (VAAI) offloads operations to storage arrays:
Hardware Accelerated Locking: Replaces SCSI reservations with atomic test and set operations. This dramatically reduces locking overhead with many VMs.
Block Zeroing: When provisioning thick disks, the array zeros blocks instead of the host writing zeros over the network. Much faster.
Full Copy: For cloning VMs, the array copies data internally instead of host reading and writing over the network. Dramatically faster.
Thin Provisioning: Array-aware thin provisioning with space reclamation.
Enable VAAI on all datastores. The performance and efficiency benefits are substantial.
Storage DRS
Storage DRS automatically balances VM storage across datastores:
Initial Placement: Chooses optimal datastore when deploying VMs.
Load Balancing: Monitors space and I/O load, migrates VMs to balance load.
Maintenance Mode: Evacuates VMs from datastores before maintenance.
Benefits:
- Reduced manual management
- Better utilization
- Avoids hot spots
Considerations:
- Storage vMotion creates I/O load during migration
- Set thresholds appropriately to avoid excessive migration
- Monitor storage DRS recommendations before enabling automation
Start with manual mode, observe recommendations, then enable automation once you’re comfortable.
Snapshot Management
Snapshots are essential for backups and testing but can cause problems:
Performance Impact: Snapshots create additional I/O as changes are tracked. Too many snapshots or very old snapshots degrade performance.
Space Consumption: Snapshots consume space. An active VM with snapshots can consume much more space than its base disks.
Snapshot Chains: VMware follows a chain of snapshots. Long chains slow I/O.
Best practices:
- Limit snapshot retention to 24-72 hours
- Avoid snapshot chains deeper than 2-3
- Consolidate snapshots regularly
- Monitor snapshot space consumption
- Use array-based snapshots when possible for better performance
Boot Storm Mitigation
When many VMs boot simultaneously, I/O spikes can overwhelm storage:
Staggered Boots: Configure VMs to boot in waves rather than all at once.
Boot from Cache: Some solutions cache boot images on flash for faster boots.
DRS Rules: Spread VMs across datastores to distribute boot load.
Flash Storage: Using flash for boot datastores eliminates boot storms as a concern.
Plan for boot storms during initial sizing. If you can’t survive a complete boot storm, you’re under-provisioned.
Alignment and Formatting
Proper alignment prevents performance problems:
Partition Alignment: Windows VMs should have partitions aligned on 64 KB boundaries. Modern Windows versions align correctly by default, but older versions don’t.
VMDK Alignment: VMDKs created through VMware are properly aligned.
Block Size: Use default VMFS block size (1 MB) unless you need to support extremely large VMDKs.
Misalignment can reduce performance by 30% or more. Always verify alignment, especially for high-performance VMs.
Network Design
Even though this is about storage, network design matters for IP storage:
Dedicated Networks: Use dedicated VLANs for storage traffic.
Jumbo Frames: Enable jumbo frames (MTU 9000) for iSCSI and NFS. This reduces overhead and improves throughput.
Link Aggregation: Use multiple NICs with load balancing for redundancy and bandwidth.
Quality of Service: Prioritize storage traffic over other traffic types.
For iSCSI specifically:
- Use dedicated NICs for storage, not shared with VM traffic
- Enable flow control
- Disable spanning tree on storage ports
- Use multiple paths for multipathing
Monitoring and Alerting
Proactive monitoring prevents problems:
Latency: Monitor average latency and, more importantly, 95th/99th percentile latency. Spikes matter more than averages.
Queue Depth: High queue depths indicate storage or path saturation.
Path Health: Monitor all paths to ensure they’re healthy and load-balanced.
Datastore Space: Alert well before datastores fill. 80% is a good warning threshold.
Snapshot Age: Alert on old snapshots that should be consolidated.
SCSI Reservations: High SCSI reservation conflicts indicate locking issues (VAAI helps).
VMware provides metrics, but correlating with array-side metrics gives complete visibility.
Backup Strategy
VMware backups require special consideration:
Agentless Backups: Use VMware snapshots and Changed Block Tracking (CBT) for efficient backups without agents in VMs.
Backup Windows: With many VMs, backups can run long. Ensure adequate time and bandwidth.
Impact: Snapshots for backup impact performance. Schedule appropriately.
Array-Based Snapshots: Using array snapshots is more efficient than VMware snapshots for backup.
Backup Target: Consider backup to dedicated storage, not production arrays.
Modern backup products integrate with VMware well, but you still need to plan for scale.
Capacity Planning
VMware makes capacity planning more complex:
Over-Provisioning: With thin provisioning, allocated capacity exceeds physical capacity. Monitor carefully to avoid running out.
Growth Rate: VMs grow over time. Plan for 20-30% annual growth.
Snapshot Space: Reserve space for snapshots—typically 10-20% of datastore capacity.
Headroom: Don’t plan to 100% utilization. Leave 20-25% headroom for performance and maintenance.
Performance Capacity: You might hit performance limits before space limits. Monitor IOPS and latency, not just GB.
Better to over-provision than to scramble when you run out of capacity or performance.
Common Mistakes
Mistakes I see frequently:
Insufficient Queue Depth: Default queue depths are too low for dense virtualization.
No Multipathing: Single path to storage eliminates redundancy and halves bandwidth.
Poor Datastore Design: Too many VMs per datastore, mixing workload types.
Ignoring VAAI: Not enabling VAAI costs performance and efficiency.
Alignment Issues: Especially on older VMs or P2V conversions.
Inadequate Monitoring: Not knowing you have a problem until users complain.
No Testing: Not testing failover scenarios before production.
Many of these are easy to fix if you know to look for them.
FC-Redirect and VMware
Storage virtualization like FC-Redirect provides benefits for VMware:
Non-Disruptive Migration: Migrate datastores between arrays without downtime.
Automated Tiering: Move busy VMDKs to flash storage automatically.
Thin Provisioning: Centralized thin provisioning across heterogeneous arrays.
Simplified Management: Manage storage centrally rather than per-array.
The combination of VMware and storage virtualization provides powerful flexibility for managing virtual infrastructure.
Conclusion
VMware storage is complex, with many variables affecting performance and reliability. The best practices I’ve outlined come from real-world experience—both successes and failures.
The keys to success:
- Understand VMware’s unique I/O patterns
- Design datastores appropriately
- Configure multipathing for performance and availability
- Tune queue depths for high density
- Enable VAAI
- Monitor comprehensively
- Plan capacity with headroom
- Test everything before production
Working with FC-Redirect has shown me how storage virtualization can simplify VMware storage management while improving flexibility. As virtualization density increases, these best practices become even more critical.
VMware storage doesn’t have to be complex or problematic. With proper design, configuration, and monitoring, it can be both performant and reliable. Invest the time to get it right from the start, and you’ll avoid many headaches later.