One of the most impactful architectural changes I made to FC-Redirect this year was introducing asynchronous processing to decouple fast-path operations from slow-path work. This pattern is common in high-performance systems, but getting it right requires careful thinking about consistency, ordering, and failure handling.

The Synchronous Processing Problem

Originally, FC-Redirect processed flow updates synchronously. When a packet arrived:

  1. Parse and validate the packet
  2. Look up the flow entry
  3. Update flow statistics
  4. Apply redirect policy
  5. Update peer nodes
  6. Write to persistent storage
  7. Update monitoring counters
  8. Forward the packet

All these operations happened in sequence before processing the next packet. This approach is simple and ensures strong consistency, but it has a fatal flaw: the fast path (steps 1-4) is blocked by slow path operations (steps 5-7).

At high packet rates, this became a bottleneck. Operations like storage writes and peer updates could take milliseconds, during which we couldn’t process new packets. Our throughput was limited by the slowest operation.

Identifying Fast vs. Slow Paths

The first step was categorizing operations by their latency characteristics and requirements:

Fast Path (must be synchronous):

  • Packet parsing: ~200ns
  • Flow lookup: ~100ns
  • Statistics update: ~50ns
  • Redirect decision: ~300ns
  • Packet forwarding: ~400ns
  • Total: ~1μs

Slow Path (can be asynchronous):

  • Peer synchronization: ~500μs
  • Persistent storage: ~2ms
  • Monitoring updates: ~100μs
  • Audit logging: ~800μs
  • Total: ~3.4ms

The slow path operations were taking 3,400x longer than the fast path! Even worse, most packets didn’t require slow path operations. We were paying millisecond costs for microsecond work.

Designing the Async Architecture

I designed a multi-queue architecture that separates concerns:

typedef struct async_work_queue {
    // Lock-free ring buffer for work items
    atomic_uint64_t head;
    atomic_uint64_t tail;
    work_item_t items[QUEUE_SIZE];

    // Processing thread
    pthread_t worker_thread;

    // Statistics
    atomic_uint64_t enqueued;
    atomic_uint64_t processed;
    atomic_uint64_t dropped;
} async_work_queue_t;

// Different queues for different work types
async_work_queue_t peer_sync_queue;
async_work_queue_t storage_queue;
async_work_queue_t monitoring_queue;

Each queue has dedicated worker threads that process items asynchronously. The fast path just enqueues work and continues immediately.

Work Item Design

Work items need to be self-contained since they’re processed asynchronously:

typedef enum {
    WORK_PEER_SYNC,
    WORK_STORAGE_WRITE,
    WORK_MONITORING_UPDATE,
    WORK_AUDIT_LOG
} work_type_t;

typedef struct work_item {
    work_type_t type;
    uint64_t sequence_number;  // For ordering
    timestamp_t enqueue_time;  // For latency tracking

    union {
        struct {
            wwpn_t flow_key;
            flow_state_t state_snapshot;
        } peer_sync;

        struct {
            wwpn_t flow_key;
            uint64_t packets;
            uint64_t bytes;
        } storage_write;

        // Other work types...
    } data;
} work_item_t;

The key is that work items contain snapshots of the data they need. They don’t reference mutable state, avoiding race conditions.

Maintaining Correctness

Asynchronous processing introduces complexity around ordering and consistency. I had to ensure several invariants:

Per-Flow Ordering

Updates for a given flow must be processed in order, but different flows can be processed in parallel:

static inline uint32_t flow_to_worker(wwpn_t flow_key) {
    // Hash flow to a specific worker thread
    // Same flow always goes to same worker
    return wwpn_hash(flow_key) % num_worker_threads;
}

void enqueue_async_work(work_item_t *item) {
    uint32_t worker = flow_to_worker(item->flow_key);
    async_work_queue_t *queue = &work_queues[worker];

    enqueue_to_ring_buffer(&queue->items, item);
    wake_worker(worker);
}

By consistently hashing flows to the same worker thread, we maintain per-flow ordering while allowing parallel processing of different flows.

Sequence Numbers

Each work item gets a monotonically increasing sequence number. Workers can detect and handle out-of-order delivery:

void process_work_item(work_item_t *item) {
    flow_state_t *state = &worker_flow_state[item->flow_key];

    if (item->sequence_number <= state->last_processed_sequence) {
        // Stale update, discard
        return;
    }

    // Process the update
    apply_update(state, item);
    state->last_processed_sequence = item->sequence_number;
}

This handles the rare case where the OS scheduler reorders work items between enqueue and processing.

Backpressure Handling

When work queues fill up (slow consumers), we need backpressure:

bool enqueue_with_backpressure(async_work_queue_t *queue,
                                work_item_t *item) {
    uint64_t head = atomic_load(&queue->head);
    uint64_t tail = atomic_load(&queue->tail);

    if (tail - head >= QUEUE_SIZE) {
        // Queue full, apply backpressure
        atomic_fetch_add(&queue->dropped, 1);

        if (item->type == WORK_PEER_SYNC) {
            // Critical work, wait for space
            while (tail - head >= QUEUE_SIZE) {
                sched_yield();
                head = atomic_load(&queue->head);
                tail = atomic_load(&queue->tail);
            }
        } else {
            // Non-critical work, drop
            return false;
        }
    }

    // Enqueue the item
    queue->items[tail % QUEUE_SIZE] = *item;
    atomic_store(&queue->tail, tail + 1);
    return true;
}

Different work types have different drop policies. Critical work (peer sync) blocks until space is available, while non-critical work (monitoring) can be dropped under load.

Worker Thread Implementation

Worker threads run a tight loop processing queued work:

void* worker_thread_func(void *arg) {
    async_work_queue_t *queue = (async_work_queue_t*)arg;

    // Pin to specific CPU core for cache locality
    pin_to_cpu(queue->cpu_affinity);

    while (!queue->shutdown) {
        uint64_t head = atomic_load(&queue->head);
        uint64_t tail = atomic_load(&queue->tail);

        if (head == tail) {
            // Queue empty, wait for work
            wait_for_work(queue, 1000); // 1ms timeout
            continue;
        }

        // Process next item
        work_item_t *item = &queue->items[head % QUEUE_SIZE];

        uint64_t start = rdtsc();
        process_work_item(item);
        uint64_t end = rdtsc();

        // Track latency
        update_latency_histogram(item->type, end - start);

        // Advance head pointer
        atomic_store(&queue->head, head + 1);
        atomic_fetch_add(&queue->processed, 1);
    }

    return NULL;
}

Key features:

  • CPU affinity for cache locality
  • Timeout-based waiting to balance latency vs CPU usage
  • Per-operation latency tracking for observability
  • Lock-free queue operations

Batching Slow Operations

Since slow path operations are now asynchronous, we can batch them for efficiency:

void storage_worker_batched(async_work_queue_t *queue) {
    work_item_t batch[BATCH_SIZE];
    size_t batch_count = 0;
    timestamp_t batch_start = get_time();

    while (!queue->shutdown) {
        work_item_t *item = dequeue_work(queue);

        if (item) {
            batch[batch_count++] = *item;
        }

        // Flush batch if full or timeout
        timestamp_t now = get_time();
        if (batch_count >= BATCH_SIZE ||
            (batch_count > 0 && (now - batch_start) > 50*MSEC)) {

            flush_storage_batch(batch, batch_count);
            batch_count = 0;
            batch_start = now;
        }
    }
}

Batching storage writes reduced IOPS by 90% while adding only 50ms of latency to non-critical operations.

Results and Impact

The asynchronous architecture delivered impressive improvements:

Throughput:

  • Packet processing: +40% (2.1M → 2.9M packets/sec)
  • Fast path latency: -30% (2.8μs → 2.0μs)
  • End-to-end latency: Unchanged (slow ops were never in critical path)

Resource Utilization:

  • CPU: -15% (better pipelining, less blocking)
  • Storage IOPS: -90% (batching)
  • Network bandwidth for peer sync: -75% (batching)

Reliability:

  • Packet loss under load: 0.01% → 0% (backpressure prevents drops)
  • Storage write failures: Isolated from fast path

Lessons Learned

This redesign reinforced several key principles:

1. Identify Your Critical Path

Not all operations are equally important. Optimize the critical path aggressively, and handle everything else asynchronously.

2. Consistency Requires Careful Design

Asynchronous processing doesn’t mean “eventually consistent.” With proper ordering guarantees and sequence numbers, you can maintain strong consistency.

3. Backpressure Is Essential

Unbounded queues eventually cause problems. Build backpressure mechanisms from day one.

4. Batching Amplifies Benefits

If you’re already async, batching slow operations multiplies the performance gains.

5. Observability Is Critical

With async processing, you can’t just log linearly. Build instrumentation for queue depths, latencies, and drop rates.

Broader Applications

These async patterns apply broadly:

  • Web servers: Async I/O for handling many connections
  • Databases: Background checkpointing and compaction
  • Message queues: Separate ingest from processing
  • Network stacks: Interrupt handling vs packet processing

Whenever you have operations with vastly different latencies, consider decoupling them asynchronously.

Looking Forward

The async architecture has become foundational to FC-Redirect’s performance. As we continue scaling throughout 2013, the ability to handle slow path operations without blocking the fast path becomes increasingly valuable.

The pattern is so successful that I’m looking at applying it to other components of our storage networking stack. Anywhere we have mixed-latency operations, async processing can unlock significant performance gains.

Sometimes the best optimization isn’t making individual operations faster; it’s ensuring fast operations aren’t blocked by slow ones. That’s the power of asynchronous architecture.