Troubleshooting Storage Networks: A Systematic Approach

Troubleshooting storage networks is both an art and a science. Over the past year working on FC-Redirect at Cisco, I’ve developed a systematic approach that I want to share. The key is methodical investigation combined with deep protocol knowledge.

The Troubleshooting Mindset

Effective troubleshooting starts with the right mindset:

Stay Calm: Pressure is high when storage is down, but panic leads to mistakes.

Be Systematic: Follow a methodical approach rather than random changes.

Gather Data: Collect information before drawing conclusions.

Question Assumptions: Verify everything. The “impossible” often turns out to be the problem.

Document: Keep notes of what you check and what you find.

I’ve seen too many troubleshooting sessions devolve into random changes hoping something works. That’s not troubleshooting—that’s chaos.

The Half-Split Method

When faced with a complex problem, use the half-split method:

Divide the system in half conceptually
Determine which half contains the problem
Divide that half in half again
Repeat until you isolate the problem

For example, if a host can’t see storage:

Can the host see the fabric? (Test HBA link status)
Can the fabric see the storage? (Check switch zoning and storage port status)
Is the LUN masked correctly? (Verify LUN masking at array)
Is zoning correct? (Verify zone configuration)

Each question eliminates half the potential problems, quickly narrowing the search.

Layer-by-Layer Analysis

Storage I/O traverses many layers. Check each systematically:

Application Layer: Is the application functioning correctly? Check logs, resource usage.

File System/Volume Manager: Are file systems mounted? Any errors in logs?

Multipathing: Are all paths healthy? Check path status and failover state.

HBA Driver: Check HBA errors, firmware versions, driver versions.

Physical Layer: Verify cables, SFPs, link status, error counters.

Fabric Layer: Check switch port status, ISL health, fabric routing.

Storage Array: Verify array health, controller status, LUN presentation.

Problems can exist at any layer. Checking all layers systematically avoids missing the problem.

Essential Diagnostic Commands

For Cisco MDS switches, these commands are invaluable:

show interface: Displays port status, counters, errors. Look for:

CRC errors (bad cables/SFPs)
Link failures (physical connectivity problems)
Loss of signal/sync (cable or SFP issues)
Discards (congestion or buffer credit issues)

show flogi database: Shows logged-in devices. Verify hosts and storage are logged in with expected WWPNs.

show fcns database: Fabric name server database. Shows all registered devices and their capabilities.

show zone active: Displays active zoning. Verify initiators and targets are in appropriate zones.

show fspf route: Shows fabric shortest path first routes. Useful for understanding traffic paths.

show topology: Displays fabric topology, helping identify ISL failures or unexpected paths.

These commands form the foundation of FC troubleshooting.

Interpreting Error Counters

Error counters tell stories if you know how to read them:

CRC Errors: Usually indicate bad cables, dirty connectors, or failing SFPs. Replace cables and clean connectors first.

Link Failures: Physical layer problems. Check cable seating, SFP compatibility, link distance vs. SFP type.

Loss of Signal/Sync: Physical layer issues. Verify cable integrity and SFP functionality.

Invalid Transmission Words: Often indicates incompatible or failing SFPs.

Timeouts: Can indicate fabric congestion, slow storage responses, or configuration issues.

Discards: Usually congestion or buffer credit problems. Check for oversubscription or insufficient credits.

Not all errors are equal. A few CRC errors might be harmless, but increasing errors indicate deteriorating conditions.

Common Problem Patterns

Certain issues recur frequently:

The Mystery of the Disappearing Path

Symptom: Multipathing shows a path down, but everything appears normal.

Common Causes:

Zoning removed the host or storage from a zone
Storage port disabled or in maintenance mode
HBA disabled or driver issue
Cable unplugged or failed

Diagnosis: Check zone configuration, storage port status, HBA status, and physical connectivity.

The Slow Storage Mystery

Symptom: I/O performance is poor, but nothing shows errors.

Common Causes:

Storage array overloaded or degraded
Fabric congestion due to insufficient ISL bandwidth
Single path being used instead of multiple paths
Application generating inefficient I/O patterns
Misaligned partitions causing extra I/O

Diagnosis: Check storage array performance metrics, fabric utilization, multipathing status, and application I/O patterns.

The Intermittent Problem

Symptom: Issues occur sporadically and are hard to reproduce.

Common Causes:

Marginal cables or SFPs that fail under load
Environmental issues (temperature, vibration)
Race conditions or timing-dependent bugs
Capacity-related issues that only occur under peak load

Diagnosis: Continuous monitoring to capture the problem when it occurs. Review correlations with time of day or specific events.

Protocol Analysis

Sometimes you need to capture and analyze actual traffic:

Fabric Analyzer: MDS switches can mirror traffic to an analyzer port. Capture frames to a protocol analyzer.

What to Look For:

Unexpected frame types or sequences
Timeouts or retransmissions
Abnormal command sequences
Error responses from storage

Understanding FC protocol deeply is essential for effective protocol analysis. You need to know what normal looks like to identify abnormal.

Performance Troubleshooting

Performance problems require specific approaches:

Establish Baseline: What is normal performance? Without a baseline, you can’t identify degradation.

Measure at Multiple Layers: Application, file system, multipathing, HBA, fabric, storage. Where is time being spent?

Check Utilization: CPU, memory, network, storage. What is saturated?

Analyze I/O Patterns: Block size, sequential vs. random, read vs. write, queue depth. Are patterns optimal?

Consider Caching: Is cache hit rate adequate? Poor cache effectiveness causes performance problems.

Performance troubleshooting is about finding bottlenecks, not just identifying errors.

Using Logs Effectively

Logs are treasure troves of information:

Correlation: Correlate logs across multiple systems. A fabric event might coincide with an application error.

Timestamps: Pay attention to exact timing. The sequence of events often reveals causation.

Severity Levels: Focus on errors and critical messages first, but don’t ignore warnings.

Patterns: Look for patterns in logs. Periodic errors might indicate a scheduled task or environmental issue.

Modern log aggregation and analysis tools help tremendously with large-scale environments.

Zoning Issues

Zoning problems are extremely common:

Wrong Zone: Host or storage in wrong zone, can’t communicate.

Missing Zone: Device not in any zone, isolated from fabric.

Zone Database Issues: Active zone set doesn’t match expected configuration.

VSAN Problems: Device in wrong VSAN, can’t communicate with devices in other VSANs.

Always verify active zoning when connectivity issues arise. Don’t assume the configuration is what you think it is.

Cable and Connectivity Issues

Physical layer problems are common:

Cable Types: Short-wave SFPs require multimode fiber. Long-wave SFPs require single-mode fiber. Mixing causes failures.

Distance Limitations: Each SFP type has maximum distance. Exceeding limits causes errors or complete failure.

Dirty Connectors: Fiber optic connectors accumulate dust and oils. Regular cleaning is essential.

Bend Radius: Excessive bending damages fiber. Respect minimum bend radius specifications.

Testing: Use power meters and OTDRs to verify fiber quality and measure loss.

Never underestimate the humble cable. Many sophisticated troubleshooting sessions end with “replace the cable.”

Switch-Specific Issues

Switches have their own common problems:

Firmware Bugs: Known bugs in specific firmware versions. Check release notes and bug databases.

Resource Exhaustion: Buffers, memory, or other resources exhausted. Monitor resource utilization.

Configuration Errors: Typos or logical errors in configuration.

Hardware Failures: Failed line cards, supervisors, power supplies, or fans.

Keep switches updated to stable firmware versions and monitor hardware health proactively.

Storage Array Issues

The array itself often turns out to be the problem:

Controller Overload: Controllers at max CPU or I/O capacity.

Cache Issues: Failed cache batteries, cache full, or cache disabled.

Disk Problems: Failed disks, degraded RAID groups, or insufficient spindle count.

Firmware/Software Issues: Array software bugs or misconfigurations.

Multipathing Mismatch: Array expects certain multipathing configuration but host is configured differently.

Work closely with storage vendors for array-level troubleshooting. They have detailed diagnostic tools and knowledge of their specific arrays.

Host-Side Troubleshooting

Don’t forget the host:

Driver/Firmware: Incompatible or buggy HBA drivers or firmware.

Operating System Issues: OS bugs, resource exhaustion, or misconfigurations.

Multipathing Configuration: Incorrect multipathing policy or parameters.

Application Problems: Application bugs or inefficient I/O patterns.

The host is often overlooked in storage troubleshooting, but it’s a critical component.

The Power of Baselines

I cannot overemphasize the value of baselines:

Normal Operation: Document what normal looks like: typical error rates, performance metrics, configurations.

Change Tracking: Document changes to configuration and when they occurred.

Trending: Track metrics over time to identify gradual degradation.

When troubleshooting, comparison to baseline quickly reveals what changed or deviated from normal.

Collaboration

Complex problems often require collaboration:

Storage Team: Experts in specific arrays and storage software.

Network Team: Experts in fabric infrastructure.

Server Team: Experts in hosts and operating systems.

Application Team: Understands application behavior and requirements.

Vendor Support: Can provide specialized knowledge and tools.

Effective troubleshooting often requires bringing together experts from multiple domains.

When to Escalate

Know when to escalate to vendor support:

Hardware Failures: Actual hardware component failures need vendor RMA process.

Known Bugs: If you identify a known bug, vendor support can provide workarounds or fixes.

Complex Scenarios: Very complex problems may require vendor-specific diagnostic tools or knowledge.

Time Pressure: If you’re under severe time pressure and making no progress, escalate sooner rather than later.

There’s no shame in escalating. Vendor support engineers have deep product knowledge and access to internal resources.

Learning from Problems

Every problem is a learning opportunity:

Root Cause Analysis: Don’t stop at fixing the symptom. Understand the root cause.

Documentation: Document the problem, investigation steps, and resolution.

Process Improvement: Could the problem have been prevented? Update processes or monitoring to prevent recurrence.

Knowledge Sharing: Share learnings with team and broader community.

Organizations that learn from problems become more resilient over time.

Conclusion

Troubleshooting storage networks successfully requires systematic methodology, deep technical knowledge, good diagnostic tools, and patience.

The half-split method, layer-by-layer analysis, and careful examination of logs and error counters will solve most problems. For the rest, collaboration and vendor support are your friends.

Most importantly, stay calm and systematic. Panic and random changes make problems worse, not better.

Working on FC-Redirect has given me extensive troubleshooting experience. The complexity of storage virtualization means there are more places for problems to hide, which has made me better at systematic investigation.

Master these troubleshooting skills, and you’ll be invaluable when storage problems arise—and they always do. The difference between a good storage engineer and a great one often comes down to troubleshooting ability.

The Troubleshooting Mindset

The Half-Split Method

Layer-by-Layer Analysis

Essential Diagnostic Commands

Interpreting Error Counters

Common Problem Patterns

The Mystery of the Disappearing Path

The Slow Storage Mystery

The Intermittent Problem

Protocol Analysis

Performance Troubleshooting

Using Logs Effectively

Zoning Issues

Cable and Connectivity Issues

Switch-Specific Issues

Storage Array Issues

Host-Side Troubleshooting

The Power of Baselines

Collaboration

When to Escalate

Learning from Problems

Conclusion

Related Posts

SAN Performance Optimization: Beyond the Basics

Preparing FC-Redirect for the MDS 9700: Next-Generation Platform

Platform Migration: Lessons from Moving to MDS 9250i