Troubleshooting storage networks is both an art and a science. Over the past year working on FC-Redirect at Cisco, I’ve developed a systematic approach that I want to share. The key is methodical investigation combined with deep protocol knowledge.
The Troubleshooting Mindset
Effective troubleshooting starts with the right mindset:
Stay Calm: Pressure is high when storage is down, but panic leads to mistakes.
Be Systematic: Follow a methodical approach rather than random changes.
Gather Data: Collect information before drawing conclusions.
Question Assumptions: Verify everything. The “impossible” often turns out to be the problem.
Document: Keep notes of what you check and what you find.
I’ve seen too many troubleshooting sessions devolve into random changes hoping something works. That’s not troubleshooting—that’s chaos.
The Half-Split Method
When faced with a complex problem, use the half-split method:
- Divide the system in half conceptually
- Determine which half contains the problem
- Divide that half in half again
- Repeat until you isolate the problem
For example, if a host can’t see storage:
- Can the host see the fabric? (Test HBA link status)
- Can the fabric see the storage? (Check switch zoning and storage port status)
- Is the LUN masked correctly? (Verify LUN masking at array)
- Is zoning correct? (Verify zone configuration)
Each question eliminates half the potential problems, quickly narrowing the search.
Layer-by-Layer Analysis
Storage I/O traverses many layers. Check each systematically:
Application Layer: Is the application functioning correctly? Check logs, resource usage.
File System/Volume Manager: Are file systems mounted? Any errors in logs?
Multipathing: Are all paths healthy? Check path status and failover state.
HBA Driver: Check HBA errors, firmware versions, driver versions.
Physical Layer: Verify cables, SFPs, link status, error counters.
Fabric Layer: Check switch port status, ISL health, fabric routing.
Storage Array: Verify array health, controller status, LUN presentation.
Problems can exist at any layer. Checking all layers systematically avoids missing the problem.
Essential Diagnostic Commands
For Cisco MDS switches, these commands are invaluable:
show interface: Displays port status, counters, errors. Look for:
- CRC errors (bad cables/SFPs)
- Link failures (physical connectivity problems)
- Loss of signal/sync (cable or SFP issues)
- Discards (congestion or buffer credit issues)
show flogi database: Shows logged-in devices. Verify hosts and storage are logged in with expected WWPNs.
show fcns database: Fabric name server database. Shows all registered devices and their capabilities.
show zone active: Displays active zoning. Verify initiators and targets are in appropriate zones.
show fspf route: Shows fabric shortest path first routes. Useful for understanding traffic paths.
show topology: Displays fabric topology, helping identify ISL failures or unexpected paths.
These commands form the foundation of FC troubleshooting.
Interpreting Error Counters
Error counters tell stories if you know how to read them:
CRC Errors: Usually indicate bad cables, dirty connectors, or failing SFPs. Replace cables and clean connectors first.
Link Failures: Physical layer problems. Check cable seating, SFP compatibility, link distance vs. SFP type.
Loss of Signal/Sync: Physical layer issues. Verify cable integrity and SFP functionality.
Invalid Transmission Words: Often indicates incompatible or failing SFPs.
Timeouts: Can indicate fabric congestion, slow storage responses, or configuration issues.
Discards: Usually congestion or buffer credit problems. Check for oversubscription or insufficient credits.
Not all errors are equal. A few CRC errors might be harmless, but increasing errors indicate deteriorating conditions.
Common Problem Patterns
Certain issues recur frequently:
The Mystery of the Disappearing Path
Symptom: Multipathing shows a path down, but everything appears normal.
Common Causes:
- Zoning removed the host or storage from a zone
- Storage port disabled or in maintenance mode
- HBA disabled or driver issue
- Cable unplugged or failed
Diagnosis: Check zone configuration, storage port status, HBA status, and physical connectivity.
The Slow Storage Mystery
Symptom: I/O performance is poor, but nothing shows errors.
Common Causes:
- Storage array overloaded or degraded
- Fabric congestion due to insufficient ISL bandwidth
- Single path being used instead of multiple paths
- Application generating inefficient I/O patterns
- Misaligned partitions causing extra I/O
Diagnosis: Check storage array performance metrics, fabric utilization, multipathing status, and application I/O patterns.
The Intermittent Problem
Symptom: Issues occur sporadically and are hard to reproduce.
Common Causes:
- Marginal cables or SFPs that fail under load
- Environmental issues (temperature, vibration)
- Race conditions or timing-dependent bugs
- Capacity-related issues that only occur under peak load
Diagnosis: Continuous monitoring to capture the problem when it occurs. Review correlations with time of day or specific events.
Protocol Analysis
Sometimes you need to capture and analyze actual traffic:
Fabric Analyzer: MDS switches can mirror traffic to an analyzer port. Capture frames to a protocol analyzer.
What to Look For:
- Unexpected frame types or sequences
- Timeouts or retransmissions
- Abnormal command sequences
- Error responses from storage
Understanding FC protocol deeply is essential for effective protocol analysis. You need to know what normal looks like to identify abnormal.
Performance Troubleshooting
Performance problems require specific approaches:
Establish Baseline: What is normal performance? Without a baseline, you can’t identify degradation.
Measure at Multiple Layers: Application, file system, multipathing, HBA, fabric, storage. Where is time being spent?
Check Utilization: CPU, memory, network, storage. What is saturated?
Analyze I/O Patterns: Block size, sequential vs. random, read vs. write, queue depth. Are patterns optimal?
Consider Caching: Is cache hit rate adequate? Poor cache effectiveness causes performance problems.
Performance troubleshooting is about finding bottlenecks, not just identifying errors.
Using Logs Effectively
Logs are treasure troves of information:
Correlation: Correlate logs across multiple systems. A fabric event might coincide with an application error.
Timestamps: Pay attention to exact timing. The sequence of events often reveals causation.
Severity Levels: Focus on errors and critical messages first, but don’t ignore warnings.
Patterns: Look for patterns in logs. Periodic errors might indicate a scheduled task or environmental issue.
Modern log aggregation and analysis tools help tremendously with large-scale environments.
Zoning Issues
Zoning problems are extremely common:
Wrong Zone: Host or storage in wrong zone, can’t communicate.
Missing Zone: Device not in any zone, isolated from fabric.
Zone Database Issues: Active zone set doesn’t match expected configuration.
VSAN Problems: Device in wrong VSAN, can’t communicate with devices in other VSANs.
Always verify active zoning when connectivity issues arise. Don’t assume the configuration is what you think it is.
Cable and Connectivity Issues
Physical layer problems are common:
Cable Types: Short-wave SFPs require multimode fiber. Long-wave SFPs require single-mode fiber. Mixing causes failures.
Distance Limitations: Each SFP type has maximum distance. Exceeding limits causes errors or complete failure.
Dirty Connectors: Fiber optic connectors accumulate dust and oils. Regular cleaning is essential.
Bend Radius: Excessive bending damages fiber. Respect minimum bend radius specifications.
Testing: Use power meters and OTDRs to verify fiber quality and measure loss.
Never underestimate the humble cable. Many sophisticated troubleshooting sessions end with “replace the cable.”
Switch-Specific Issues
Switches have their own common problems:
Firmware Bugs: Known bugs in specific firmware versions. Check release notes and bug databases.
Resource Exhaustion: Buffers, memory, or other resources exhausted. Monitor resource utilization.
Configuration Errors: Typos or logical errors in configuration.
Hardware Failures: Failed line cards, supervisors, power supplies, or fans.
Keep switches updated to stable firmware versions and monitor hardware health proactively.
Storage Array Issues
The array itself often turns out to be the problem:
Controller Overload: Controllers at max CPU or I/O capacity.
Cache Issues: Failed cache batteries, cache full, or cache disabled.
Disk Problems: Failed disks, degraded RAID groups, or insufficient spindle count.
Firmware/Software Issues: Array software bugs or misconfigurations.
Multipathing Mismatch: Array expects certain multipathing configuration but host is configured differently.
Work closely with storage vendors for array-level troubleshooting. They have detailed diagnostic tools and knowledge of their specific arrays.
Host-Side Troubleshooting
Don’t forget the host:
Driver/Firmware: Incompatible or buggy HBA drivers or firmware.
Operating System Issues: OS bugs, resource exhaustion, or misconfigurations.
Multipathing Configuration: Incorrect multipathing policy or parameters.
Application Problems: Application bugs or inefficient I/O patterns.
The host is often overlooked in storage troubleshooting, but it’s a critical component.
The Power of Baselines
I cannot overemphasize the value of baselines:
Normal Operation: Document what normal looks like: typical error rates, performance metrics, configurations.
Change Tracking: Document changes to configuration and when they occurred.
Trending: Track metrics over time to identify gradual degradation.
When troubleshooting, comparison to baseline quickly reveals what changed or deviated from normal.
Collaboration
Complex problems often require collaboration:
Storage Team: Experts in specific arrays and storage software.
Network Team: Experts in fabric infrastructure.
Server Team: Experts in hosts and operating systems.
Application Team: Understands application behavior and requirements.
Vendor Support: Can provide specialized knowledge and tools.
Effective troubleshooting often requires bringing together experts from multiple domains.
When to Escalate
Know when to escalate to vendor support:
Hardware Failures: Actual hardware component failures need vendor RMA process.
Known Bugs: If you identify a known bug, vendor support can provide workarounds or fixes.
Complex Scenarios: Very complex problems may require vendor-specific diagnostic tools or knowledge.
Time Pressure: If you’re under severe time pressure and making no progress, escalate sooner rather than later.
There’s no shame in escalating. Vendor support engineers have deep product knowledge and access to internal resources.
Learning from Problems
Every problem is a learning opportunity:
Root Cause Analysis: Don’t stop at fixing the symptom. Understand the root cause.
Documentation: Document the problem, investigation steps, and resolution.
Process Improvement: Could the problem have been prevented? Update processes or monitoring to prevent recurrence.
Knowledge Sharing: Share learnings with team and broader community.
Organizations that learn from problems become more resilient over time.
Conclusion
Troubleshooting storage networks successfully requires systematic methodology, deep technical knowledge, good diagnostic tools, and patience.
The half-split method, layer-by-layer analysis, and careful examination of logs and error counters will solve most problems. For the rest, collaboration and vendor support are your friends.
Most importantly, stay calm and systematic. Panic and random changes make problems worse, not better.
Working on FC-Redirect has given me extensive troubleshooting experience. The complexity of storage virtualization means there are more places for problems to hide, which has made me better at systematic investigation.
Master these troubleshooting skills, and you’ll be invaluable when storage problems arise—and they always do. The difference between a good storage engineer and a great one often comes down to troubleshooting ability.