High Availability at Scale: Lessons from 99.999% Uptime

Our customers demand extreme availability. When FC-Redirect goes down, their entire storage infrastructure is affected. This year, we achieved 99.999% uptime in production deployments, even as we scaled to 12K flows. Here’s how we did it.

Understanding Five-Nines

99.999% uptime means less than 5.26 minutes of downtime per year. That’s an incredibly tight budget:

Per month: 26 seconds
Per week: 6 seconds
Per day: 0.86 seconds

A single restart takes longer than our weekly downtime budget. Traditional approaches like “patch and reboot” don’t work. We needed to design for continuous operation.

Redundancy at Every Layer

The foundation of high availability is eliminating single points of failure:

Control Plane Redundancy

FC-Redirect runs on a cluster of nodes. Every piece of state is replicated across at least three nodes:

typedef struct replicated_state {
    // Primary copy
    flow_table_t *primary;

    // Replicas (at least 2)
    replica_endpoint_t replicas[MAX_REPLICAS];
    uint32_t replica_count;

    // Quorum configuration
    uint32_t quorum_size;  // Typically (N/2) + 1
} replicated_state_t;

bool write_with_quorum(replicated_state_t *state,
                       flow_update_t *update) {
    // Write to primary
    apply_update(state->primary, update);

    // Replicate to peers
    uint32_t acks = 1;  // Primary counts as 1
    for (int i = 0; i < state->replica_count; i++) {
        if (replicate_sync(&state->replicas[i], update)) {
            acks++;
        }

        if (acks >= state->quorum_size) {
            // Quorum achieved, write is durable
            return true;
        }
    }

    // Failed to achieve quorum
    return false;
}

Using quorum-based replication means we can tolerate failures of minority nodes while maintaining consistency and availability.

Data Plane Redundancy

For the data plane (actual packet processing), we use active-active redundancy. Multiple nodes handle traffic simultaneously, and if one fails, others pick up its load:

typedef struct flow_ownership {
    node_id_t primary_owner;
    node_id_t secondary_owner;
    uint64_t ownership_version;
} flow_ownership_t;

void handle_node_failure(node_id_t failed_node) {
    // Find all flows owned by failed node
    for (int i = 0; i < flow_table_size; i++) {
        flow_entry_t *flow = &flow_table[i];

        if (flow->ownership.primary_owner == failed_node) {
            // Promote secondary to primary
            flow->ownership.primary_owner =
                flow->ownership.secondary_owner;

            // Assign new secondary
            flow->ownership.secondary_owner =
                select_new_secondary_owner(flow);

            flow->ownership.ownership_version++;

            // Replicate ownership change
            broadcast_ownership_change(flow);
        }
    }
}

This failover happens in milliseconds, well within our availability budget.

Network Path Redundancy

We leverage the fabric’s native multipathing capabilities. Every flow has multiple valid paths through the fabric:

typedef struct flow_paths {
    fc_path_t primary_path;
    fc_path_t secondary_path;
    fc_path_t tertiary_path;

    uint32_t active_path_index;
    path_health_t health[3];
} flow_paths_t;

fc_path_t* select_active_path(flow_paths_t *paths) {
    // Try paths in order of health
    for (int i = 0; i < 3; i++) {
        if (paths->health[i] == PATH_HEALTHY) {
            paths->active_path_index = i;

            switch (i) {
                case 0: return &paths->primary_path;
                case 1: return &paths->secondary_path;
                case 2: return &paths->tertiary_path;
            }
        }
    }

    // All paths unhealthy, use least-bad option
    return &paths->primary_path;
}

Path failover is handled in hardware by the FC fabric, giving us sub-second switchover.

Graceful Degradation

High availability isn’t just about redundancy; it’s about graceful degradation when things go wrong.

Priority-Based Shedding

When a node is overloaded, we don’t fail completely. Instead, we shed low-priority work:

typedef enum {
    PRIORITY_CRITICAL = 0,    // Flow control, health checks
    PRIORITY_HIGH = 1,        // Active data flows
    PRIORITY_NORMAL = 2,      // Statistics updates
    PRIORITY_LOW = 3          // Logging, monitoring
} work_priority_t;

bool should_accept_work(work_priority_t priority) {
    uint32_t load = get_current_load_percent();

    if (load < 70) {
        return true;  // Accept all work
    } else if (load < 85) {
        return priority <= PRIORITY_NORMAL;  // Shed low priority
    } else if (load < 95) {
        return priority <= PRIORITY_HIGH;  // Shed normal and low
    } else {
        return priority == PRIORITY_CRITICAL;  // Only critical work
    }
}

This ensures the system remains available for critical operations even when overloaded.

Circuit Breakers

We implement circuit breakers for external dependencies:

typedef struct circuit_breaker {
    atomic_uint32_t failure_count;
    atomic_uint64_t last_failure_time;
    atomic_uint32_t state;  // CLOSED, OPEN, HALF_OPEN

    uint32_t failure_threshold;
    uint64_t timeout_ms;
} circuit_breaker_t;

bool execute_with_circuit_breaker(circuit_breaker_t *cb,
                                  operation_fn op, void *arg) {
    uint32_t state = atomic_load(&cb->state);

    if (state == CIRCUIT_OPEN) {
        // Check if timeout has passed
        uint64_t now = get_time_ms();
        uint64_t last_failure = atomic_load(&cb->last_failure_time);

        if ((now - last_failure) > cb->timeout_ms) {
            // Try half-open
            atomic_store(&cb->state, CIRCUIT_HALF_OPEN);
        } else {
            // Circuit still open, fail fast
            return false;
        }
    }

    // Try the operation
    if (op(arg)) {
        // Success, reset circuit breaker
        atomic_store(&cb->failure_count, 0);
        atomic_store(&cb->state, CIRCUIT_CLOSED);
        return true;
    } else {
        // Failure, update circuit breaker
        uint32_t failures = atomic_fetch_add(&cb->failure_count, 1) + 1;
        atomic_store(&cb->last_failure_time, get_time_ms());

        if (failures >= cb->failure_threshold) {
            atomic_store(&cb->state, CIRCUIT_OPEN);
        }

        return false;
    }
}

This prevents cascading failures when a dependency (like storage) becomes unavailable.

Health Monitoring and Failure Detection

Fast failure detection is critical. We can’t fix problems we don’t know about.

Heartbeat Protocol

Nodes exchange heartbeats every 100ms:

typedef struct heartbeat {
    node_id_t source;
    uint64_t sequence_number;
    timestamp_t send_time;
    node_health_t health;
    uint32_t load_percent;
} heartbeat_t;

void heartbeat_monitor_thread() {
    while (running) {
        timestamp_t now = get_time_ms();

        for (int i = 0; i < num_peers; i++) {
            peer_node_t *peer = &peers[i];

            uint64_t last_heartbeat = peer->last_heartbeat_time;
            if ((now - last_heartbeat) > HEARTBEAT_TIMEOUT) {
                // Peer is down
                handle_peer_failure(peer);
            }
        }

        sleep_ms(50);  // Check twice per heartbeat interval
    }
}

With 100ms heartbeats and 300ms timeout, we detect failures in under half a second.

Self-Health Checks

Each node monitors its own health:

typedef struct health_check_result {
    bool cpu_ok;
    bool memory_ok;
    bool network_ok;
    bool storage_ok;
    uint32_t overall_health_score;  // 0-100
} health_check_result_t;

health_check_result_t perform_health_check() {
    health_check_result_t result = {0};

    // CPU check
    uint32_t cpu_load = get_cpu_load_percent();
    result.cpu_ok = (cpu_load < 90);

    // Memory check
    uint32_t mem_used = get_memory_used_percent();
    result.memory_ok = (mem_used < 85);

    // Network check
    result.network_ok = check_network_connectivity();

    // Storage check
    result.storage_ok = check_storage_responsiveness();

    // Overall score (weighted average)
    result.overall_health_score =
        (result.cpu_ok ? 25 : 0) +
        (result.memory_ok ? 25 : 0) +
        (result.network_ok ? 25 : 0) +
        (result.storage_ok ? 25 : 0);

    return result;
}

Nodes advertise their health in heartbeats. If health degrades, we proactively shed load before failure.

Rolling Upgrades

To maintain availability during upgrades, we use rolling updates:

typedef enum {
    UPGRADE_STATE_NORMAL,
    UPGRADE_STATE_DRAINING,
    UPGRADE_STATE_UPGRADING,
    UPGRADE_STATE_VALIDATING
} upgrade_state_t;

void perform_rolling_upgrade(node_id_t node) {
    // 1. Drain traffic
    set_node_state(node, UPGRADE_STATE_DRAINING);
    drain_node_traffic(node, 30*SECONDS);

    // 2. Verify drained
    wait_for_zero_active_flows(node, 60*SECONDS);

    // 3. Upgrade
    set_node_state(node, UPGRADE_STATE_UPGRADING);
    apply_software_upgrade(node);

    // 4. Validate
    set_node_state(node, UPGRADE_STATE_VALIDATING);
    if (!validate_node_health(node)) {
        rollback_upgrade(node);
        return;
    }

    // 5. Return to service
    set_node_state(node, UPGRADE_STATE_NORMAL);
    enable_node_traffic(node);
}

void upgrade_cluster() {
    for (int i = 0; i < num_nodes; i++) {
        // Upgrade one node at a time
        perform_rolling_upgrade(nodes[i].id);

        // Wait for stability before next node
        sleep_seconds(60);
    }
}

This allows us to upgrade the entire cluster without any downtime.

Chaos Engineering

We actively test our high availability mechanisms through controlled chaos:

void chaos_test_node_failure() {
    // Randomly kill a node
    node_id_t victim = select_random_node();
    kill_node(victim);

    // Verify cluster remains healthy
    assert(check_cluster_health());
    assert(check_no_data_loss());
    assert(check_performance_within_bounds());

    // Restart the node
    restart_node(victim);

    // Verify recovery
    assert(wait_for_node_healthy(victim, 60*SECONDS));
    assert(check_cluster_health());
}

void chaos_test_network_partition() {
    // Create a network partition
    node_id_t nodes_a[] = {0, 1, 2};
    node_id_t nodes_b[] = {3, 4, 5};
    create_network_partition(nodes_a, 3, nodes_b, 3);

    // Verify both sides remain available
    assert(check_partition_availability(nodes_a, 3));
    assert(check_partition_availability(nodes_b, 3));

    // Heal the partition
    heal_network_partition();

    // Verify convergence
    assert(wait_for_cluster_convergence(120*SECONDS));
}

We run these chaos tests weekly in our staging environment. They’ve caught numerous subtle bugs before they reached production.

Operational Practices

Technology alone doesn’t deliver high availability. Operational practices matter:

1. Blameless Postmortems

Every incident gets a thorough postmortem focused on learning, not blaming. We document:

What happened
Why it happened
How we detected it
How we fixed it
How we’ll prevent it

2. Runbook Automation

Common operational tasks are automated:

Failover procedures
Health checks
Recovery procedures
Upgrade workflows

This eliminates human error and speeds response time.

3. Monitoring and Alerting

We monitor everything:

Node health
Flow counts and rates
Error rates
Latency percentiles
Resource utilization

Alerts are tuned to be actionable. No alert fatigue.

4. Testing in Production

We use canary deployments and feature flags to test changes with a small percentage of traffic before rolling out broadly.

Results

These techniques have delivered exceptional results:

Uptime: 99.999% over the last 12 months
Mean Time to Detect: 45 seconds
Mean Time to Recovery: 3 minutes
Zero data loss events
Zero unplanned outages

Most importantly, our customers trust the system. For storage infrastructure, trust is everything.

Lessons Learned

Building for high availability requires:

Redundancy at every layer: No single points of failure
Fast failure detection: Can’t fix what you don’t know is broken
Graceful degradation: Better to be slow than down
Automated recovery: Humans are too slow
Testing chaos: If you don’t test failure modes, they will surprise you
Cultural commitment: Everyone must prioritize reliability

High availability isn’t a feature; it’s a culture and a set of practices. It requires constant vigilance and continuous improvement.

As we continue scaling FC-Redirect throughout 2013, maintaining five-nines availability gets harder, not easier. But with the right architecture and practices, it’s absolutely achievable.

Your customers depend on your availability. Don’t let them down.