Disaster Recovery Planning for Storage Infrastructure

Disaster recovery is insurance you hope never to use. But when disaster strikes, DR capabilities determine whether you survive or fail. Over the past year at Cisco, I’ve seen both excellent and inadequate DR implementations. Let me share what makes effective storage DR.

Understanding DR Requirements

Start by defining requirements:

Recovery Point Objective (RPO): How much data can you afford to lose? Minutes? Hours? Days?

Recovery Time Objective (RTO): How long can you be down? Minutes? Hours? Days?

Scope: Which systems require DR? Not everything needs the same protection.

Testing: How often will you test? Untested DR doesn’t work when you need it.

These requirements drive architecture and budget decisions.

DR vs. Backup vs. High Availability

Distinguish between related but different concepts:

High Availability: Survive component failures within a data center. Automated failover, measured in seconds or minutes.

Disaster Recovery: Survive site-wide failures. Failover to different site, measured in hours or days.

Backup: Point-in-time copies for recovering from data corruption, deletion, or retention requirements.

You need all three, but they solve different problems with different technologies.

Replication Technologies

Replication is the foundation of most DR strategies:

Synchronous Replication

How It Works: Writes are sent to both local and remote storage. Acknowledged only when both confirm.

RPO: Zero data loss (RPO=0).

RTO: Depends on application failover time, can be minutes.

Constraints: Requires low latency between sites (typically <5ms), distance limited to ~100km depending on application tolerance.

Use Cases: Mission-critical applications where no data loss is acceptable.

Synchronous replication is the gold standard but has distance and cost limitations.

Asynchronous Replication

How It Works: Writes are acknowledged locally, then replicated to remote site with some delay.

RPO: Non-zero, typically minutes to hours depending on configuration.

RTO: Depends on application failover, typically hours.

Constraints: No distance limitation (works over WAN).

Use Cases: Applications that can tolerate some data loss, long-distance DR.

Asynchronous replication is more flexible but accepts some data loss.

Snapshot-Based Replication

How It Works: Periodic snapshots are replicated to DR site.

RPO: Snapshot interval (typically hours).

RTO: Hours to days depending on recovery process.

Benefits: Lower overhead, simpler than continuous replication.

Use Cases: Applications with looser RPO requirements.

Array-Based vs. Host-Based Replication

Two approaches to implementing replication:

Array-Based Replication

Advantages:

Offloads work from hosts and applications
Transparent to applications
Leverages array intelligence (only replicate changed blocks)
Can replicate many LUNs efficiently

Disadvantages:

Usually requires same storage vendor at both sites
Expensive licensing
Tied to array capabilities

Most enterprise environments use array-based replication.

Host-Based Replication

Advantages:

Works with any storage
Application-aware (can ensure consistency)
More flexible

Disadvantages:

Consumes host resources
May impact application performance
More complex configuration

Common for databases (Oracle Data Guard, SQL Server mirroring) where application-level consistency matters.

Network-Based Replication

Storage virtualization like FC-Redirect enables network-based replication:

Approach: Virtualization layer replicates data to DR site.

Benefits:

Works with heterogeneous arrays
Centralized management
Can replicate across different storage types

Challenges:

Requires virtualization infrastructure at both sites
Adds complexity

Network-based replication provides flexibility in heterogeneous environments.

Consistency Groups

For applications spanning multiple LUNs, consistency matters:

The Problem: If LUNs are replicated independently, the remote copies may not be consistent with each other.

Solution: Consistency groups ensure a set of LUNs are replicated as a unit.

Example: Database with data files on one LUN and logs on another. Both must be consistent at the DR site.

Always use consistency groups for multi-LUN applications.

DR Site Options

Several models for DR sites:

Active-Passive

Setup: Production runs at primary site. DR site is standby, used only during disasters.

Pros: Simpler, lower cost.

Cons: DR resources sit idle, must fail over during disaster.

Typical RTO: Hours to days.

Active-Active

Setup: Both sites run production workloads. Each can take over for the other.

Pros: All resources utilized, faster failover, can be automated.

Cons: More complex, more expensive, requires load balancing.

Typical RTO: Minutes to hours.

Pilot Light

Setup: Core infrastructure runs at DR site but with minimal capacity. Scale up during disaster.

Pros: Lower cost than full active-passive, faster than cold standby.

Cons: Still requires scale-up time during disaster.

Typical RTO: Hours.

Choose based on RTO requirements and budget.

Distance Considerations

Geography impacts DR design:

Metro Distance (<50 km): Enables synchronous replication, very low latency.

Campus Distance (50-100 km): Marginal for synchronous replication depending on application.

Long Distance (>100 km): Requires asynchronous replication.

Geographic Diversity: Balance low latency (closer sites) with disaster protection (farther sites).

For simultaneous disasters affecting a region, farther is better. For performance, closer is better.

Bandwidth Planning

DR replication consumes bandwidth:

Initial Sync: Full copy of data to DR site. This can take days or weeks for large datasets.

Ongoing Replication: Bandwidth depends on change rate. Monitor actual change rates and provision accordingly.

Burst Capacity: Plan for spikes (backups, batch jobs, etc.).

Compression: Replication compression can reduce bandwidth needs 2-5x.

Under-provisioned bandwidth causes replication lag, increasing RPO.

Failover and Failback

Plan both directions:

Failover (Production to DR)

Triggering: Who decides to failover? What are the criteria?

Process: Documented step-by-step procedures.

Testing: Regular testing to ensure procedures work.

Communication: Notify stakeholders about failover.

Verification: Verify DR site is functioning before declaring success.

Failback (DR to Production)

Timing: When to fail back? Not during crisis—wait until primary site is fully recovered.

Reverse Replication: Replicate changes from DR site back to primary.

Verification: Verify primary site before failing back.

Minimize Disruption: Plan failback during maintenance window.

Failback is often overlooked but equally important.

Application-Level Considerations

Different applications have different DR requirements:

Databases: Need consistency, may require database-aware replication.

File Servers: Can often tolerate looser consistency.

Email: User data needs protection, but some temporary unavailability may be acceptable.

Web Servers: Stateless tiers may not need storage replication (can redeploy).

Tailor DR strategy to application requirements.

Testing DR

Untested DR doesn’t work:

Test Types

Tabletop Exercise: Walk through procedures without actual failover.

Partial Test: Test specific components or applications.

Full Test: Complete failover to DR site.

Surprise Test: Unannounced test to verify procedures and team readiness.

Test Frequency

Critical Systems: Quarterly testing.

Important Systems: Semi-annual testing.

Less Critical: Annual testing.

Regular testing keeps procedures current and teams trained.

Test Documentation

Test Plan: Document what will be tested and success criteria.

Test Results: Document what worked and what didn’t.

Action Items: Fix identified issues before next test.

Lessons Learned: Share learnings across organization.

Orchestration and Automation

DR orchestration tools provide:

Runbooks: Automated execution of failover procedures.

Dependency Management: Bring up systems in correct order.

Verification: Automated verification that systems are running correctly.

Rollback: Ability to roll back if failover fails.

Reporting: Document what happened during failover.

Automation reduces human error during crisis and enables faster RTO.

Data Validation

Verify DR data is usable:

Integrity Checks: Verify replicated data matches source.

Application Validation: Can applications actually use DR data?

Performance Testing: Is DR site performance adequate?

Restore Testing: Can you restore from DR backups?

Finding out during disaster that DR data is corrupted is catastrophic.

Multi-Tier Applications

Modern applications span multiple tiers:

Web/App Tier: Often stateless, can be redeployed.

Database Tier: Stateful, requires careful DR.

File Services: User data needs protection.

Dependencies: External services that application depends on.

Configuration: Ensure configurations at DR site are correct.

Ensure all tiers are included in DR plan and brought up in correct order.

Cloud as DR Target

Cloud provides interesting DR options:

Cost-Effective: Pay only for storage, not idle compute.

Scalable: Scale up compute during disaster, scale down afterward.

Geographic Diversity: Use distant cloud regions for disaster protection.

Challenges: Network bandwidth, data transfer costs, performance.

Cloud DR is increasingly common, especially for smaller organizations.

RPO/RTO Cost Curve

There’s a cost/benefit relationship:

Lower RPO/RTO = Higher Cost: Zero RPO requires synchronous replication, active-active sites, automation.

Higher RPO/RTO = Lower Cost: Daily backups shipped off-site are cheap but provide high RPO/RTO.

Diminishing Returns: Going from 24 hours to 1 hour RPO is cheaper than going from 1 hour to zero.

Match investment to business requirements. Not everything needs zero RPO/RTO.

Monitoring and Alerting

Proactive monitoring of DR:

Replication Health: Is replication current? Any errors?

Replication Lag: How far behind is the DR site?

Bandwidth Utilization: Is network adequate for replication load?

DR Site Health: Are DR systems healthy and ready?

Capacity: Adequate capacity at DR site?

Certificate Expiration: Management certificates don’t expire.

You don’t want to discover during disaster that replication has been broken for weeks.

Documentation

Critical documentation for DR:

DR Plan: Overall strategy and architecture.

Runbooks: Step-by-step failover procedures.

Network Diagrams: Topology of both sites and interconnections.

Configuration Details: Exact configurations of all components.

Contact Lists: Who to call for different systems/vendors.

Escalation Procedures: When to escalate, to whom.

Keep documentation current. Outdated documentation during crisis is worse than none.

Common DR Mistakes

Mistakes I see frequently:

No Testing: DR plan that’s never tested doesn’t work.

Incomplete Scope: Forgetting components like DNS, Active Directory, network.

Insufficient Bandwidth: Replication can’t keep up with change rate.

No Consistency Groups: Multi-LUN applications with inconsistent replicas.

Assuming Success: Not verifying that applications work at DR site.

Outdated Procedures: Procedures that don’t reflect current environment.

Single Point of Failure: DR site depends on resource at primary site.

These mistakes are avoidable with proper planning and testing.

Lessons from Real Disasters

What we learn from actual disasters:

Documentation Critical: During crisis, you need procedures to follow.

Practice Matters: Teams that have practiced handle crises better.

Automation Helps: Automated procedures execute faster and more reliably than manual.

Communication Essential: Keep stakeholders informed during crisis.

Expect Surprises: Something will always be unexpected. Build in flexibility.

Don’t Rush: Mistakes during failover create new problems.

Learn from others’ disasters. You don’t want to learn these lessons the hard way.

Conclusion

Effective disaster recovery requires careful planning, appropriate technology, regular testing, and organizational commitment. It’s not just technology—it’s process, people, and testing.

Key principles:

Define clear RPO/RTO requirements
Choose replication technology matching requirements
Test regularly
Automate where possible
Document everything
Monitor replication health
Plan for failback, not just failover

Working on FC-Redirect has shown me how storage virtualization can simplify DR by enabling replication between heterogeneous arrays and providing centralized orchestration.

DR is insurance. It seems expensive until you need it. Organizations that invest in proper DR survive disasters. Those that don’t may not.

Plan your DR, test it, maintain it, and hope you never need it. But when you do, you’ll be grateful you invested in it.