High Availability Design Patterns for Storage Infrastructure

High availability is not optional for enterprise storage—it’s a fundamental requirement. Applications expect storage to be always available, and achieving this requires careful architecture. Working on FC-Redirect has taught me a lot about designing for HA. Let me share key patterns and practices.

Defining High Availability

First, what does “high availability” mean? Common metrics include:

Uptime Percentage: 99.9% (three nines) means ~8.76 hours of downtime per year. 99.99% (four nines) means ~52.6 minutes per year. 99.999% (five nines) means ~5.26 minutes per year.

RTO (Recovery Time Objective): How long can you be down? Minutes? Seconds?

RPO (Recovery Point Objective): How much data can you afford to lose? None? Minutes worth?

Understanding requirements is essential. Achieving five nines is vastly more expensive than three nines. You need to match architecture to requirements.

Redundancy Principles

HA is fundamentally about eliminating single points of failure through redundancy:

N+1 Redundancy: One extra component beyond minimum required. If you need 2 power supplies, install 3.

N+N Redundancy: Full duplication. Two complete systems, either could handle the full load.

2N+1 Redundancy: Two complete systems plus one extra. Handles multiple simultaneous failures.

The right level depends on availability requirements and acceptable cost.

Storage Array HA

Modern storage arrays are designed for HA:

Redundant Controllers: Active-active or active-passive controller configurations. If one controller fails, the other takes over.

Redundant Power Supplies: Multiple power supplies, each capable of powering the entire array.

Redundant Fans: Multiple cooling fans with enough redundancy that the array stays cool with some fans failed.

RAID: Protects against disk failures. RAID 6 survives two concurrent disk failures.

Hot Spares: Pre-installed spare disks that automatically replace failed disks.

Battery Backup: Protected write cache survives power failures without data loss.

High-end arrays have no single point of failure. Every component is redundant.

Fabric Redundancy

The FC fabric must also be redundant:

Dual Fabrics: Completely separate FC fabrics. If one fabric fails entirely, the other continues operating.

Redundant Switches: Multiple switches per fabric. If one switch fails, others handle traffic.

Redundant ISLs: Multiple inter-switch links. If one ISL fails, others carry the traffic.

Redundant Directors: Director-class switches with redundant supervisors, power supplies, and fabric modules.

Best practice is to have two completely independent fabrics—different switches, different cables, different power circuits. This survives even major failures.

Host Connectivity

Hosts need redundant connections to storage:

Multiple HBAs: At least two HBAs per host, each connected to a different fabric.

Multipathing: Software that uses multiple paths and fails over automatically when paths fail.

Load Balancing: Distribute I/O across all available paths for better performance and faster failure detection.

With proper multipathing configuration, path failures are transparent to applications—a brief I/O pause of milliseconds, not an outage.

Multipathing Strategies

Multipathing software implements various strategies:

Active-Active: All paths are used simultaneously. If any path fails, remaining paths carry the load.

Active-Passive: One path is preferred, others are standby. Failover happens when the active path fails.

Round-Robin: I/O alternates across all paths. Good for load balancing and failure detection.

Least Queue Depth: Route I/O to the path with fewest outstanding commands.

Active-active with round-robin provides the best combination of performance and HA in most scenarios.

Failure Detection

Quick failure detection is critical for HA. The faster you detect failures, the faster you can respond:

Link Failures: FC detects link failures via loss of signal or loss of sync. Detection is typically sub-second.

Path Failures: Timeouts on I/O operations indicate path failures. Tuning timeout values balances quick detection against false positives.

Array Failures: Health monitoring detects array component failures. Arrays send alerts and may proactively fail over.

Switch Failures: Fabric monitoring detects switch failures. Automatic rerouting occurs within seconds.

In FC-Redirect, we detect switch failures quickly through heartbeat mechanisms and can failover virtualization state to other switches.

Stateful Failover

For true HA, failover must preserve state. In FC-Redirect, this means:

Metadata Replication: Virtual LUN configurations and mappings are replicated across multiple switches.

Synchronous Updates: State changes are synchronously replicated before acknowledging to hosts.

Quorum: A majority of switches must agree before making state changes. This prevents split-brain scenarios.

Fast Failover: When a switch fails, another switch takes over its responsibilities within seconds.

This stateful failover enables non-disruptive failures. Hosts see a brief I/O pause but no lost data or hung operations.

Split-Brain Avoidance

Split-brain is a scenario where parts of the system become isolated and diverge:

The Problem: Two components both think they’re active and make conflicting changes. When connectivity restores, state is inconsistent.

Fencing: Explicitly prevent failed components from accessing shared resources. SCSI reservations can fence failed nodes.

Quorum: Require majority agreement for state changes. A minority partition cannot make changes.

Witness: A third-party arbitrator decides which partition is authoritative during network partitions.

Split-brain can cause data corruption, so it must be prevented architecturally.

Testing Failure Scenarios

HA designs must be tested. Common tests:

Component Failures: Unplug cables, power off switches, remove HBAs. Verify failover works.

Concurrent Failures: Fail multiple components simultaneously. Does the system survive?

Failure During Load: Failures under heavy I/O load are harder to handle than idle failures.

Incomplete Failures: Sometimes components partially fail—responding to some operations but not others. These can be harder to detect and handle than complete failures.

I cannot overemphasize the importance of testing. Many HA designs work in theory but fail in practice due to subtle bugs or configuration errors.

Planned Maintenance

HA isn’t just about unplanned failures—it’s also about surviving planned maintenance:

Rolling Upgrades: Upgrade components one at a time while others remain operational.

Hitless Failover: Manually fail over to redundant components to service primary components.

Online Expansion: Add capacity or components without downtime.

Good HA design enables maintenance without downtime. This is often more important than surviving unexpected failures.

Monitoring and Alerting

You can’t fix what you don’t know about. Comprehensive monitoring is essential:

Component Health: Monitor all components for failures or degradation.

Path Health: Verify all paths are operational and performing well.

Capacity: Track capacity usage and project when you’ll run out.

Performance: Monitor IOPS, throughput, latency. Detect degradation before it becomes severe.

Predictive Failures: Modern arrays predict disk failures before they occur. Act on these warnings.

Alerts should be actionable. Alert fatigue from too many false positives leads to ignored alerts and missed real problems.

Change Management

Paradoxically, many outages are self-inflicted during changes. Change management is HA-critical:

Change Windows: Schedule changes during low-usage periods when impact is minimized.

Testing: Test changes in non-production environments first.

Rollback Plans: Every change needs a documented rollback procedure.

Peer Review: Have changes reviewed by another engineer before execution.

Incremental Changes: Make small changes rather than large ones. Smaller changes have less risk and are easier to troubleshoot.

Most outages I’ve seen were caused by well-intentioned changes gone wrong, not hardware failures.

Documentation

During outages, you don’t want to be searching for documentation:

Runbooks: Step-by-step procedures for common scenarios.

Configuration Documentation: As-built documentation of current configuration.

Topology Diagrams: Clear diagrams showing how everything connects.

Contact Lists: Who to call for different systems and vendors.

Troubleshooting Guides: Flowcharts for diagnosing common problems.

Keep documentation current. Outdated documentation is worse than no documentation—it misleads rather than helps.

Disaster Recovery

HA handles component failures, but disaster recovery handles site failures:

Replication: Replicate data to another site. Synchronous replication provides RPO=0 but requires low latency. Asynchronous replication tolerates distance but has non-zero RPO.

Regular Testing: Test failover to DR site regularly. Untested DR plans don’t work when you need them.

Runbooks: Document exactly how to activate DR site.

Data Consistency: Ensure DR data is consistent, especially for multi-tier applications with dependencies.

DR is expensive and complex, but for critical systems, it’s necessary.

Cost vs. Availability Trade-offs

Higher availability costs more:

Additional Hardware: Redundant components cost money.

Complexity: More complex architectures are harder to manage and troubleshoot.

Testing Overhead: Proper testing requires time and resources.

Operational Procedures: Maintaining HA requires careful operational discipline.

Not every system needs five nines. Architect appropriate availability for each workload’s requirements.

Common Pitfalls

HA mistakes I’ve seen:

Shared Dependencies: “Redundant” components that share a power circuit or network switch aren’t truly independent.

Untested Failover: Failover that works in theory but has never been tested often doesn’t work in practice.

Configuration Drift: Redundant components that should be identical but have drifted to different configurations.

Tight Coupling: Systems so tightly coupled that a failure in one cascades to others.

Human Error: Not accounting for human mistakes, which cause more outages than hardware failures.

Real-World HA

In practice, achieving high availability requires:

Architecture: Eliminate single points of failure through redundancy.

Implementation: Configure and deploy correctly. Details matter enormously.

Testing: Validate that failover actually works.

Operations: Maintain configurations, monitor health, respond to alerts.

Culture: Organization-wide commitment to availability.

HA is not just a technology problem—it’s an organizational discipline.

Conclusion

High availability is achievable but not automatic. It requires careful architecture, correct implementation, thorough testing, and disciplined operations.

Working on FC-Redirect has reinforced how important stateful failover is for true HA. Being able to fail over virtualization state quickly and transparently enables non-disruptive failures—the holy grail of HA.

As storage becomes more critical to business operations, HA requirements only increase. Understanding HA principles and patterns is essential for anyone designing or operating storage infrastructure.

The cost of downtime—in lost revenue, productivity, and reputation—far exceeds the cost of implementing proper HA. Invest in availability upfront, test it thoroughly, and maintain it diligently. Your future self (and your users) will thank you.