VMware Storage Best Practices: Lessons from the Field

VMware has become the dominant virtualization platform, and it places unique demands on storage infrastructure. Over the past year, I’ve seen many VMware deployments at Cisco customers, both excellent and problematic. Let me share best practices for VMware storage that I’ve learned.

Understanding VMware Storage I/O

VMware’s storage I/O pattern differs from traditional workloads:

Randomization: Even if individual VMs have sequential I/O, when multiplexed through the hypervisor, the combined I/O becomes random. This kills performance on traditional storage.

Density: A single server might host 30+ VMs, each with its own I/O pattern. The aggregate I/O demand is substantial.

Variability: VM workloads vary significantly. Database VMs have different characteristics than file server VMs.

Boot Storms: When many VMs boot simultaneously (after maintenance or power event), I/O spikes dramatically.

Understanding these patterns is essential for proper storage design.

Datastore Design

VMFS datastore design significantly impacts performance and manageability:

Datastore Sizing

Small Datastores (500 GB - 1 TB):

Pros: Easier to manage, better I/O isolation between datastores
Cons: More datastores to manage, can run out of LUNs

Large Datastores (2+ TB):

Pros: Fewer to manage, can hold more VMs
Cons: Less I/O isolation, SCSI locking can become an issue

My recommendation: 1-2 TB datastores as a balance. Not so large that SCSI locking becomes an issue, not so small that management becomes overwhelming.

VMs Per Datastore

Limit VMs per datastore to manage I/O contention and failure blast radius:

Production VMs: 15-20 VMs per datastore maximum Non-Production VMs: 30-40 VMs per datastore acceptable

These are guidelines, not hard rules. Adjust based on VM I/O characteristics and storage performance.

Datastore Tiering

Separate datastores by workload characteristics:

Tier 0: High-performance VMs on all-flash or hybrid arrays with large SSD caches Tier 1: Production VMs on fast spinning disk (15K RPM) Tier 2: Non-production VMs on slower disk (10K or 7.2K RPM) Tier 3: Archives and backups on capacity-optimized storage

Don’t mix tiers in the same datastore. I/O from low-tier VMs impacts high-tier VMs.

Multipathing Configuration

Proper multipathing is critical for both performance and availability:

Path Selection Policy

VMware supports several multipathing policies:

Fixed: Uses a preferred path, fails over only when that path is down. Simple but doesn’t load balance.

Most Recently Used (MRU): Uses one path until it fails, then switches to another. Common for active-passive arrays but doesn’t load balance.

Round Robin (RR): Distributes I/O across all paths. Best for performance and failure detection.

For active-active arrays (most modern arrays), use Round Robin. It provides:

Load balancing across all paths
Better aggregate bandwidth
Faster failure detection
Better utilization of multipath infrastructure

Path Failover

Configure appropriate timeout values:

IOPS: Lower values (5-10 seconds) detect failures quickly but might cause false positives during transient congestion.

Path Down: How long to wait before declaring a path down. 15-30 seconds is typical.

Test failover scenarios to ensure timeouts are appropriate for your environment.

Queue Depth Tuning

Queue depth limits how many outstanding I/O operations are allowed:

Adapter Queue Depth

The HBA queue depth defaults to 32 or 64. For busy virtualization hosts, this may be too low:

Modern HBAs support queue depths of 128 or higher
Increase adapter queue depth to 128 or 256 for better throughput
Monitor for saturation and adjust accordingly

LUN Queue Depth

Each LUN has a queue depth (default 32). With many VMs on a datastore:

Calculate: (Number of VMs) × (Expected queue depth per VM)
If this exceeds 32, increase LUN queue depth
Don’t set too high—excessive queuing increases latency

Finding the right balance requires monitoring and tuning.

VAAI Benefits

VMware vStorage APIs for Array Integration (VAAI) offloads operations to storage arrays:

Hardware Accelerated Locking: Replaces SCSI reservations with atomic test and set operations. This dramatically reduces locking overhead with many VMs.

Block Zeroing: When provisioning thick disks, the array zeros blocks instead of the host writing zeros over the network. Much faster.

Full Copy: For cloning VMs, the array copies data internally instead of host reading and writing over the network. Dramatically faster.

Thin Provisioning: Array-aware thin provisioning with space reclamation.

Enable VAAI on all datastores. The performance and efficiency benefits are substantial.

Storage DRS

Storage DRS automatically balances VM storage across datastores:

Initial Placement: Chooses optimal datastore when deploying VMs.

Load Balancing: Monitors space and I/O load, migrates VMs to balance load.

Maintenance Mode: Evacuates VMs from datastores before maintenance.

Benefits:

Reduced manual management
Better utilization
Avoids hot spots

Considerations:

Storage vMotion creates I/O load during migration
Set thresholds appropriately to avoid excessive migration
Monitor storage DRS recommendations before enabling automation

Start with manual mode, observe recommendations, then enable automation once you’re comfortable.

Snapshot Management

Snapshots are essential for backups and testing but can cause problems:

Performance Impact: Snapshots create additional I/O as changes are tracked. Too many snapshots or very old snapshots degrade performance.

Space Consumption: Snapshots consume space. An active VM with snapshots can consume much more space than its base disks.

Snapshot Chains: VMware follows a chain of snapshots. Long chains slow I/O.

Best practices:

Limit snapshot retention to 24-72 hours
Avoid snapshot chains deeper than 2-3
Consolidate snapshots regularly
Monitor snapshot space consumption
Use array-based snapshots when possible for better performance

Boot Storm Mitigation

When many VMs boot simultaneously, I/O spikes can overwhelm storage:

Staggered Boots: Configure VMs to boot in waves rather than all at once.

Boot from Cache: Some solutions cache boot images on flash for faster boots.

DRS Rules: Spread VMs across datastores to distribute boot load.

Flash Storage: Using flash for boot datastores eliminates boot storms as a concern.

Plan for boot storms during initial sizing. If you can’t survive a complete boot storm, you’re under-provisioned.

Alignment and Formatting

Proper alignment prevents performance problems:

Partition Alignment: Windows VMs should have partitions aligned on 64 KB boundaries. Modern Windows versions align correctly by default, but older versions don’t.

VMDK Alignment: VMDKs created through VMware are properly aligned.

Block Size: Use default VMFS block size (1 MB) unless you need to support extremely large VMDKs.

Misalignment can reduce performance by 30% or more. Always verify alignment, especially for high-performance VMs.

Network Design

Even though this is about storage, network design matters for IP storage:

Dedicated Networks: Use dedicated VLANs for storage traffic.

Jumbo Frames: Enable jumbo frames (MTU 9000) for iSCSI and NFS. This reduces overhead and improves throughput.

Link Aggregation: Use multiple NICs with load balancing for redundancy and bandwidth.

Quality of Service: Prioritize storage traffic over other traffic types.

For iSCSI specifically:

Use dedicated NICs for storage, not shared with VM traffic
Enable flow control
Disable spanning tree on storage ports
Use multiple paths for multipathing

Monitoring and Alerting

Proactive monitoring prevents problems:

Latency: Monitor average latency and, more importantly, 95th/99th percentile latency. Spikes matter more than averages.

Queue Depth: High queue depths indicate storage or path saturation.

Path Health: Monitor all paths to ensure they’re healthy and load-balanced.

Datastore Space: Alert well before datastores fill. 80% is a good warning threshold.

Snapshot Age: Alert on old snapshots that should be consolidated.

SCSI Reservations: High SCSI reservation conflicts indicate locking issues (VAAI helps).

VMware provides metrics, but correlating with array-side metrics gives complete visibility.

Backup Strategy

VMware backups require special consideration:

Agentless Backups: Use VMware snapshots and Changed Block Tracking (CBT) for efficient backups without agents in VMs.

Backup Windows: With many VMs, backups can run long. Ensure adequate time and bandwidth.

Impact: Snapshots for backup impact performance. Schedule appropriately.

Array-Based Snapshots: Using array snapshots is more efficient than VMware snapshots for backup.

Backup Target: Consider backup to dedicated storage, not production arrays.

Modern backup products integrate with VMware well, but you still need to plan for scale.

Capacity Planning

VMware makes capacity planning more complex:

Over-Provisioning: With thin provisioning, allocated capacity exceeds physical capacity. Monitor carefully to avoid running out.

Growth Rate: VMs grow over time. Plan for 20-30% annual growth.

Snapshot Space: Reserve space for snapshots—typically 10-20% of datastore capacity.

Headroom: Don’t plan to 100% utilization. Leave 20-25% headroom for performance and maintenance.

Performance Capacity: You might hit performance limits before space limits. Monitor IOPS and latency, not just GB.

Better to over-provision than to scramble when you run out of capacity or performance.

Common Mistakes

Mistakes I see frequently:

Insufficient Queue Depth: Default queue depths are too low for dense virtualization.

No Multipathing: Single path to storage eliminates redundancy and halves bandwidth.

Poor Datastore Design: Too many VMs per datastore, mixing workload types.

Ignoring VAAI: Not enabling VAAI costs performance and efficiency.

Alignment Issues: Especially on older VMs or P2V conversions.

Inadequate Monitoring: Not knowing you have a problem until users complain.

No Testing: Not testing failover scenarios before production.

Many of these are easy to fix if you know to look for them.

FC-Redirect and VMware

Storage virtualization like FC-Redirect provides benefits for VMware:

Non-Disruptive Migration: Migrate datastores between arrays without downtime.

Automated Tiering: Move busy VMDKs to flash storage automatically.

Thin Provisioning: Centralized thin provisioning across heterogeneous arrays.

Simplified Management: Manage storage centrally rather than per-array.

The combination of VMware and storage virtualization provides powerful flexibility for managing virtual infrastructure.

Conclusion

VMware storage is complex, with many variables affecting performance and reliability. The best practices I’ve outlined come from real-world experience—both successes and failures.

The keys to success:

Understand VMware’s unique I/O patterns
Design datastores appropriately
Configure multipathing for performance and availability
Tune queue depths for high density
Enable VAAI
Monitor comprehensively
Plan capacity with headroom
Test everything before production

Working with FC-Redirect has shown me how storage virtualization can simplify VMware storage management while improving flexibility. As virtualization density increases, these best practices become even more critical.

VMware storage doesn’t have to be complex or problematic. With proper design, configuration, and monitoring, it can be both performant and reliable. Invest the time to get it right from the start, and you’ll avoid many headaches later.