Our customers demand extreme availability. When FC-Redirect goes down, their entire storage infrastructure is affected. This year, we achieved 99.999% uptime in production deployments, even as we scaled to 12K flows. Here’s how we did it.
Understanding Five-Nines
99.999% uptime means less than 5.26 minutes of downtime per year. That’s an incredibly tight budget:
- Per month: 26 seconds
- Per week: 6 seconds
- Per day: 0.86 seconds
A single restart takes longer than our weekly downtime budget. Traditional approaches like “patch and reboot” don’t work. We needed to design for continuous operation.
Redundancy at Every Layer
The foundation of high availability is eliminating single points of failure:
Control Plane Redundancy
FC-Redirect runs on a cluster of nodes. Every piece of state is replicated across at least three nodes:
typedef struct replicated_state {
// Primary copy
flow_table_t *primary;
// Replicas (at least 2)
replica_endpoint_t replicas[MAX_REPLICAS];
uint32_t replica_count;
// Quorum configuration
uint32_t quorum_size; // Typically (N/2) + 1
} replicated_state_t;
bool write_with_quorum(replicated_state_t *state,
flow_update_t *update) {
// Write to primary
apply_update(state->primary, update);
// Replicate to peers
uint32_t acks = 1; // Primary counts as 1
for (int i = 0; i < state->replica_count; i++) {
if (replicate_sync(&state->replicas[i], update)) {
acks++;
}
if (acks >= state->quorum_size) {
// Quorum achieved, write is durable
return true;
}
}
// Failed to achieve quorum
return false;
}
Using quorum-based replication means we can tolerate failures of minority nodes while maintaining consistency and availability.
Data Plane Redundancy
For the data plane (actual packet processing), we use active-active redundancy. Multiple nodes handle traffic simultaneously, and if one fails, others pick up its load:
typedef struct flow_ownership {
node_id_t primary_owner;
node_id_t secondary_owner;
uint64_t ownership_version;
} flow_ownership_t;
void handle_node_failure(node_id_t failed_node) {
// Find all flows owned by failed node
for (int i = 0; i < flow_table_size; i++) {
flow_entry_t *flow = &flow_table[i];
if (flow->ownership.primary_owner == failed_node) {
// Promote secondary to primary
flow->ownership.primary_owner =
flow->ownership.secondary_owner;
// Assign new secondary
flow->ownership.secondary_owner =
select_new_secondary_owner(flow);
flow->ownership.ownership_version++;
// Replicate ownership change
broadcast_ownership_change(flow);
}
}
}
This failover happens in milliseconds, well within our availability budget.
Network Path Redundancy
We leverage the fabric’s native multipathing capabilities. Every flow has multiple valid paths through the fabric:
typedef struct flow_paths {
fc_path_t primary_path;
fc_path_t secondary_path;
fc_path_t tertiary_path;
uint32_t active_path_index;
path_health_t health[3];
} flow_paths_t;
fc_path_t* select_active_path(flow_paths_t *paths) {
// Try paths in order of health
for (int i = 0; i < 3; i++) {
if (paths->health[i] == PATH_HEALTHY) {
paths->active_path_index = i;
switch (i) {
case 0: return &paths->primary_path;
case 1: return &paths->secondary_path;
case 2: return &paths->tertiary_path;
}
}
}
// All paths unhealthy, use least-bad option
return &paths->primary_path;
}
Path failover is handled in hardware by the FC fabric, giving us sub-second switchover.
Graceful Degradation
High availability isn’t just about redundancy; it’s about graceful degradation when things go wrong.
Priority-Based Shedding
When a node is overloaded, we don’t fail completely. Instead, we shed low-priority work:
typedef enum {
PRIORITY_CRITICAL = 0, // Flow control, health checks
PRIORITY_HIGH = 1, // Active data flows
PRIORITY_NORMAL = 2, // Statistics updates
PRIORITY_LOW = 3 // Logging, monitoring
} work_priority_t;
bool should_accept_work(work_priority_t priority) {
uint32_t load = get_current_load_percent();
if (load < 70) {
return true; // Accept all work
} else if (load < 85) {
return priority <= PRIORITY_NORMAL; // Shed low priority
} else if (load < 95) {
return priority <= PRIORITY_HIGH; // Shed normal and low
} else {
return priority == PRIORITY_CRITICAL; // Only critical work
}
}
This ensures the system remains available for critical operations even when overloaded.
Circuit Breakers
We implement circuit breakers for external dependencies:
typedef struct circuit_breaker {
atomic_uint32_t failure_count;
atomic_uint64_t last_failure_time;
atomic_uint32_t state; // CLOSED, OPEN, HALF_OPEN
uint32_t failure_threshold;
uint64_t timeout_ms;
} circuit_breaker_t;
bool execute_with_circuit_breaker(circuit_breaker_t *cb,
operation_fn op, void *arg) {
uint32_t state = atomic_load(&cb->state);
if (state == CIRCUIT_OPEN) {
// Check if timeout has passed
uint64_t now = get_time_ms();
uint64_t last_failure = atomic_load(&cb->last_failure_time);
if ((now - last_failure) > cb->timeout_ms) {
// Try half-open
atomic_store(&cb->state, CIRCUIT_HALF_OPEN);
} else {
// Circuit still open, fail fast
return false;
}
}
// Try the operation
if (op(arg)) {
// Success, reset circuit breaker
atomic_store(&cb->failure_count, 0);
atomic_store(&cb->state, CIRCUIT_CLOSED);
return true;
} else {
// Failure, update circuit breaker
uint32_t failures = atomic_fetch_add(&cb->failure_count, 1) + 1;
atomic_store(&cb->last_failure_time, get_time_ms());
if (failures >= cb->failure_threshold) {
atomic_store(&cb->state, CIRCUIT_OPEN);
}
return false;
}
}
This prevents cascading failures when a dependency (like storage) becomes unavailable.
Health Monitoring and Failure Detection
Fast failure detection is critical. We can’t fix problems we don’t know about.
Heartbeat Protocol
Nodes exchange heartbeats every 100ms:
typedef struct heartbeat {
node_id_t source;
uint64_t sequence_number;
timestamp_t send_time;
node_health_t health;
uint32_t load_percent;
} heartbeat_t;
void heartbeat_monitor_thread() {
while (running) {
timestamp_t now = get_time_ms();
for (int i = 0; i < num_peers; i++) {
peer_node_t *peer = &peers[i];
uint64_t last_heartbeat = peer->last_heartbeat_time;
if ((now - last_heartbeat) > HEARTBEAT_TIMEOUT) {
// Peer is down
handle_peer_failure(peer);
}
}
sleep_ms(50); // Check twice per heartbeat interval
}
}
With 100ms heartbeats and 300ms timeout, we detect failures in under half a second.
Self-Health Checks
Each node monitors its own health:
typedef struct health_check_result {
bool cpu_ok;
bool memory_ok;
bool network_ok;
bool storage_ok;
uint32_t overall_health_score; // 0-100
} health_check_result_t;
health_check_result_t perform_health_check() {
health_check_result_t result = {0};
// CPU check
uint32_t cpu_load = get_cpu_load_percent();
result.cpu_ok = (cpu_load < 90);
// Memory check
uint32_t mem_used = get_memory_used_percent();
result.memory_ok = (mem_used < 85);
// Network check
result.network_ok = check_network_connectivity();
// Storage check
result.storage_ok = check_storage_responsiveness();
// Overall score (weighted average)
result.overall_health_score =
(result.cpu_ok ? 25 : 0) +
(result.memory_ok ? 25 : 0) +
(result.network_ok ? 25 : 0) +
(result.storage_ok ? 25 : 0);
return result;
}
Nodes advertise their health in heartbeats. If health degrades, we proactively shed load before failure.
Rolling Upgrades
To maintain availability during upgrades, we use rolling updates:
typedef enum {
UPGRADE_STATE_NORMAL,
UPGRADE_STATE_DRAINING,
UPGRADE_STATE_UPGRADING,
UPGRADE_STATE_VALIDATING
} upgrade_state_t;
void perform_rolling_upgrade(node_id_t node) {
// 1. Drain traffic
set_node_state(node, UPGRADE_STATE_DRAINING);
drain_node_traffic(node, 30*SECONDS);
// 2. Verify drained
wait_for_zero_active_flows(node, 60*SECONDS);
// 3. Upgrade
set_node_state(node, UPGRADE_STATE_UPGRADING);
apply_software_upgrade(node);
// 4. Validate
set_node_state(node, UPGRADE_STATE_VALIDATING);
if (!validate_node_health(node)) {
rollback_upgrade(node);
return;
}
// 5. Return to service
set_node_state(node, UPGRADE_STATE_NORMAL);
enable_node_traffic(node);
}
void upgrade_cluster() {
for (int i = 0; i < num_nodes; i++) {
// Upgrade one node at a time
perform_rolling_upgrade(nodes[i].id);
// Wait for stability before next node
sleep_seconds(60);
}
}
This allows us to upgrade the entire cluster without any downtime.
Chaos Engineering
We actively test our high availability mechanisms through controlled chaos:
void chaos_test_node_failure() {
// Randomly kill a node
node_id_t victim = select_random_node();
kill_node(victim);
// Verify cluster remains healthy
assert(check_cluster_health());
assert(check_no_data_loss());
assert(check_performance_within_bounds());
// Restart the node
restart_node(victim);
// Verify recovery
assert(wait_for_node_healthy(victim, 60*SECONDS));
assert(check_cluster_health());
}
void chaos_test_network_partition() {
// Create a network partition
node_id_t nodes_a[] = {0, 1, 2};
node_id_t nodes_b[] = {3, 4, 5};
create_network_partition(nodes_a, 3, nodes_b, 3);
// Verify both sides remain available
assert(check_partition_availability(nodes_a, 3));
assert(check_partition_availability(nodes_b, 3));
// Heal the partition
heal_network_partition();
// Verify convergence
assert(wait_for_cluster_convergence(120*SECONDS));
}
We run these chaos tests weekly in our staging environment. They’ve caught numerous subtle bugs before they reached production.
Operational Practices
Technology alone doesn’t deliver high availability. Operational practices matter:
1. Blameless Postmortems
Every incident gets a thorough postmortem focused on learning, not blaming. We document:
- What happened
- Why it happened
- How we detected it
- How we fixed it
- How we’ll prevent it
2. Runbook Automation
Common operational tasks are automated:
- Failover procedures
- Health checks
- Recovery procedures
- Upgrade workflows
This eliminates human error and speeds response time.
3. Monitoring and Alerting
We monitor everything:
- Node health
- Flow counts and rates
- Error rates
- Latency percentiles
- Resource utilization
Alerts are tuned to be actionable. No alert fatigue.
4. Testing in Production
We use canary deployments and feature flags to test changes with a small percentage of traffic before rolling out broadly.
Results
These techniques have delivered exceptional results:
- Uptime: 99.999% over the last 12 months
- Mean Time to Detect: 45 seconds
- Mean Time to Recovery: 3 minutes
- Zero data loss events
- Zero unplanned outages
Most importantly, our customers trust the system. For storage infrastructure, trust is everything.
Lessons Learned
Building for high availability requires:
- Redundancy at every layer: No single points of failure
- Fast failure detection: Can’t fix what you don’t know is broken
- Graceful degradation: Better to be slow than down
- Automated recovery: Humans are too slow
- Testing chaos: If you don’t test failure modes, they will surprise you
- Cultural commitment: Everyone must prioritize reliability
High availability isn’t a feature; it’s a culture and a set of practices. It requires constant vigilance and continuous improvement.
As we continue scaling FC-Redirect throughout 2013, maintaining five-nines availability gets harder, not easier. But with the right architecture and practices, it’s absolutely achievable.
Your customers depend on your availability. Don’t let them down.