Lock-free programming is often seen as dark magic: difficult, error-prone, and only for experts. After implementing several lock-free data structures in FC-Redirect, I’ve learned when they’re worth the complexity and how to implement them correctly. Let me share what I’ve learned.

Why Lock-Free?

Locks have problems at scale:

Contention: With 8 threads competing for a lock, average wait time can exceed the critical section itself.

Priority Inversion: Low-priority thread holds lock, blocking high-priority thread.

Deadlock Risk: Multiple locks can deadlock if acquired in wrong order.

Cache Line Bouncing: Lock variable ping-pongs between CPU caches.

Lock-free data structures eliminate these issues by using atomic compare-and-swap (CAS) operations instead of locks.

When to Use Lock-Free

Lock-free isn’t always better. Use it when:

  1. High contention: Multiple threads frequently accessing shared data
  2. Real-time requirements: Unbounded wait times are unacceptable
  3. Scalability critical: Need linear scaling across cores
  4. Simple operations: The data structure operations are straightforward

Don’t use lock-free when:

  1. Complex operations: Multi-step operations are hard to make lock-free
  2. Low contention: Locks are fine if rarely contested
  3. Development time constrained: Lock-free takes longer to implement correctly

Lock-Free Ring Buffer

The most useful lock-free structure I’ve implemented is a single-producer, single-consumer (SPSC) ring buffer:

typedef struct lockfree_ring_buffer {
    void **data;
    size_t capacity;

    // Separate cache lines to avoid false sharing
    alignas(64) atomic_size_t head;  // Consumer owns
    alignas(64) atomic_size_t tail;  // Producer owns

    // Padding to fill cache line
    char padding[64 - sizeof(atomic_size_t)];
} lockfree_ring_buffer_t;

lockfree_ring_buffer_t* create_ring_buffer(size_t capacity) {
    // Capacity must be power of 2 for fast modulo
    assert((capacity & (capacity - 1)) == 0);

    lockfree_ring_buffer_t *rb = aligned_alloc(64, sizeof(*rb));

    rb->data = calloc(capacity, sizeof(void*));
    rb->capacity = capacity;

    atomic_store_explicit(&rb->head, 0, memory_order_relaxed);
    atomic_store_explicit(&rb->tail, 0, memory_order_relaxed);

    return rb;
}

Key design points:

  1. Cache line alignment: Head and tail are in separate cache lines to prevent false sharing.

  2. Power-of-2 capacity: Allows fast modulo using bitwise AND instead of division.

  3. Single producer/consumer: Simplifies memory ordering requirements.

Producer (Enqueue)

bool ring_buffer_enqueue(lockfree_ring_buffer_t *rb, void *item) {
    size_t tail = atomic_load_explicit(&rb->tail, memory_order_relaxed);
    size_t head = atomic_load_explicit(&rb->head, memory_order_acquire);

    // Check if full
    size_t next_tail = (tail + 1) & (rb->capacity - 1);
    if (next_tail == head) {
        return false;  // Full
    }

    // Write data
    rb->data[tail] = item;

    // Update tail (release semantics ensures data write is visible)
    atomic_store_explicit(&rb->tail, next_tail, memory_order_release);

    return true;
}

Memory ordering is critical:

  • acquire on head read: Ensures we see consumer’s updates
  • release on tail write: Ensures data write is visible before tail update
  • relaxed on tail read: We’re the only writer, no synchronization needed

Consumer (Dequeue)

void* ring_buffer_dequeue(lockfree_ring_buffer_t *rb) {
    size_t head = atomic_load_explicit(&rb->head, memory_order_relaxed);
    size_t tail = atomic_load_explicit(&rb->tail, memory_order_acquire);

    // Check if empty
    if (head == tail) {
        return NULL;  // Empty
    }

    // Read data
    void *item = rb->data[head];

    // Update head (release semantics for symmetry, not strictly required)
    size_t next_head = (head + 1) & (rb->capacity - 1);
    atomic_store_explicit(&rb->head, next_head, memory_order_release);

    return item;
}

This SPSC ring buffer is incredibly fast: single-digit nanoseconds per operation on modern CPUs.

Multi-Producer, Single-Consumer Queue

For multiple producers, we need compare-and-swap:

typedef struct mpsc_node {
    void *data;
    atomic_uintptr_t next;
} mpsc_node_t;

typedef struct mpsc_queue {
    alignas(64) atomic_uintptr_t head;  // Consumer owns
    alignas(64) atomic_uintptr_t tail;  // Producers compete for

    mpsc_node_t stub;  // Dummy node to simplify implementation
} mpsc_queue_t;

void mpsc_queue_init(mpsc_queue_t *q) {
    q->stub.data = NULL;
    atomic_store(&q->stub.next, 0);

    atomic_store(&q->head, (uintptr_t)&q->stub);
    atomic_store(&q->tail, (uintptr_t)&q->stub);
}

Enqueue (Multiple Producers)

void mpsc_queue_enqueue(mpsc_queue_t *q, void *data) {
    // Allocate node
    mpsc_node_t *node = malloc(sizeof(mpsc_node_t));
    node->data = data;
    atomic_store(&node->next, 0);

    // Atomically swap tail
    mpsc_node_t *prev = (mpsc_node_t*)atomic_exchange(&q->tail,
                                                       (uintptr_t)node);

    // Link previous tail to new node
    atomic_store(&prev->next, (uintptr_t)node);
}

The atomic exchange on tail ensures only one producer successfully claims the tail position at a time.

Dequeue (Single Consumer)

void* mpsc_queue_dequeue(mpsc_queue_t *q) {
    mpsc_node_t *head = (mpsc_node_t*)atomic_load(&q->head);
    mpsc_node_t *next = (mpsc_node_t*)atomic_load(&head->next);

    if (next == NULL) {
        return NULL;  // Empty
    }

    void *data = next->data;

    atomic_store(&q->head, (uintptr_t)next);

    // Free old head (was stub or previous node)
    if (head != &q->stub) {
        free(head);
    }

    return data;
}

Lock-Free Hash Table

For our flow lookup, I implemented a lock-free hash table using open addressing:

typedef struct lf_hash_entry {
    atomic_uint64_t key;
    atomic_uint64_t value;
    atomic_uint32_t version;  // For ABA problem
} lf_hash_entry_t;

typedef struct lf_hash_table {
    lf_hash_entry_t *entries;
    size_t capacity;
    size_t mask;  // capacity - 1

    atomic_size_t count;
} lf_hash_table_t;

#define EMPTY_KEY 0
#define DELETED_KEY ((uint64_t)-1)

Lock-Free Lookup

uint64_t lf_hash_lookup(lf_hash_table_t *table, uint64_t key) {
    assert(key != EMPTY_KEY && key != DELETED_KEY);

    uint64_t hash = hash_key(key);
    size_t index = hash & table->mask;

    // Linear probing
    for (size_t i = 0; i < table->capacity; i++) {
        lf_hash_entry_t *entry = &table->entries[index];

        uint64_t entry_key = atomic_load_explicit(&entry->key,
                                                  memory_order_acquire);

        if (entry_key == key) {
            // Found it
            return atomic_load_explicit(&entry->value,
                                       memory_order_acquire);
        }

        if (entry_key == EMPTY_KEY) {
            // Not found
            return 0;
        }

        // Collision, try next slot
        index = (index + 1) & table->mask;
    }

    // Table full
    return 0;
}

Lock-Free Insert

bool lf_hash_insert(lf_hash_table_t *table, uint64_t key, uint64_t value) {
    assert(key != EMPTY_KEY && key != DELETED_KEY);
    assert(value != 0);

    uint64_t hash = hash_key(key);
    size_t index = hash & table->mask;

    for (size_t i = 0; i < table->capacity; i++) {
        lf_hash_entry_t *entry = &table->entries[index];

        uint64_t entry_key = atomic_load_explicit(&entry->key,
                                                  memory_order_acquire);

        if (entry_key == key) {
            // Key exists, update value
            atomic_store_explicit(&entry->value, value,
                                 memory_order_release);
            return true;
        }

        if (entry_key == EMPTY_KEY || entry_key == DELETED_KEY) {
            // Try to claim this slot
            uint64_t expected = entry_key;

            if (atomic_compare_exchange_strong(&entry->key, &expected, key)) {
                // Successfully claimed, write value
                atomic_store_explicit(&entry->value, value,
                                     memory_order_release);

                atomic_fetch_add(&table->count, 1);
                return true;
            }

            // CAS failed, someone else claimed it, retry
            i--;  // Try this slot again
            continue;
        }

        // Collision, try next slot
        index = (index + 1) & table->mask;
    }

    // Table full
    return false;
}

The CAS ensures only one thread successfully claims an empty slot.

Lock-Free Delete

Deletion is tricky in lock-free open-addressed hash tables:

bool lf_hash_delete(lf_hash_table_t *table, uint64_t key) {
    uint64_t hash = hash_key(key);
    size_t index = hash & table->mask;

    for (size_t i = 0; i < table->capacity; i++) {
        lf_hash_entry_t *entry = &table->entries[index];

        uint64_t entry_key = atomic_load(&entry->key);

        if (entry_key == key) {
            // Mark as deleted
            atomic_store(&entry->key, DELETED_KEY);
            atomic_store(&entry->value, 0);

            atomic_fetch_sub(&table->count, 1);
            return true;
        }

        if (entry_key == EMPTY_KEY) {
            // Not found
            return false;
        }

        index = (index + 1) & table->mask;
    }

    return false;
}

We mark slots as DELETED rather than EMPTY to preserve probe chains.

The ABA Problem

Lock-free data structures are vulnerable to the ABA problem:

Thread 1:                    Thread 2:
Read value A
                            Change A to B
                            Change B back to A
CAS(A, new_value)
  succeeds! (but A isn't the same A)

Solution: Use versioned pointers or generation counters:

typedef struct versioned_ptr {
    void *ptr;
    uint64_t version;
} versioned_ptr_t;

// Pack into 128-bit value for double-wide CAS
typedef __int128 versioned_ptr_atomic_t;

bool versioned_cas(atomic_versioned_ptr_t *target,
                  versioned_ptr_t expected,
                  versioned_ptr_t desired) {
    __int128 expected_packed = pack_versioned_ptr(expected);
    __int128 desired_packed = pack_versioned_ptr(desired);

    return atomic_compare_exchange_strong(target,
                                         &expected_packed,
                                         desired_packed);
}

The version counter prevents the ABA problem by changing even when the pointer doesn’t.

Memory Ordering: The Subtle Killer

Memory ordering is where lock-free programming gets dangerous. Here’s my mental model:

Relaxed

atomic_store_explicit(&x, value, memory_order_relaxed);

No synchronization. Only guarantees atomicity, not ordering. Use when:

  • Single variable
  • No dependencies on other variables
  • Performance critical

Acquire/Release

// Thread 1
atomic_store_explicit(&data, value, memory_order_release);

// Thread 2
value = atomic_load_explicit(&data, memory_order_acquire);

Forms a synchronization point. All writes before release are visible after acquire. Use for:

  • Producer/consumer communication
  • Publishing data
  • Most lock-free structures

Sequential Consistency

atomic_store(&x, value);  // Default is seq_cst

Strongest ordering. Global order of all seq_cst operations. Use when:

  • Correctness is unclear with weaker ordering
  • Performance isn’t critical
  • During development (optimize later)

Testing Lock-Free Code

Lock-free bugs are Heisenbugs. They disappear when you add logging or debuggers. My testing strategy:

Stress Test

void stress_test_ring_buffer() {
    lockfree_ring_buffer_t *rb = create_ring_buffer(1024);

    atomic_bool running = true;
    atomic_uint64_t enqueued = 0;
    atomic_uint64_t dequeued = 0;

    // Producer thread
    pthread_t producer;
    pthread_create(&producer, NULL, [](void *arg) {
        auto *rb = (lockfree_ring_buffer_t*)arg;

        while (atomic_load(&running)) {
            if (ring_buffer_enqueue(rb, (void*)(uintptr_t)123)) {
                atomic_fetch_add(&enqueued, 1);
            }
        }

        return NULL;
    }, rb);

    // Consumer thread
    pthread_t consumer;
    pthread_create(&consumer, NULL, [](void *arg) {
        auto *rb = (lockfree_ring_buffer_t*)arg;

        while (atomic_load(&running)) {
            if (ring_buffer_dequeue(rb) != NULL) {
                atomic_fetch_add(&dequeued, 1);
            }
        }

        return NULL;
    }, rb);

    // Run for 60 seconds
    sleep(60);
    atomic_store(&running, false);

    pthread_join(producer, NULL);
    pthread_join(consumer, NULL);

    // Verify no items lost
    assert(enqueued == dequeued);
}

Thread Sanitizer

Clang’s thread sanitizer detects data races:

clang -fsanitize=thread -g -O2 test.c -o test
./test

It caught several subtle bugs in my initial implementations.

Formal Verification

For critical structures, I use model checking:

// Simplified model for CBMC model checker
void model_ring_buffer() {
    size_t head, tail;
    void *data[4];  // Small size for model checking

    // Nondeterministic operations
    if (nondet_bool()) {
        // Enqueue
        size_t next_tail = (tail + 1) & 3;
        __CPROVER_assume(next_tail != head);  // Assume not full

        data[tail] = nondet_ptr();
        tail = next_tail;
    } else {
        // Dequeue
        __CPROVER_assume(head != tail);  // Assume not empty

        void *item = data[head];
        __CPROVER_assert(item != NULL, "dequeued item should be valid");

        head = (head + 1) & 3;
    }
}

Performance Results

Lock-free structures delivered significant improvements in FC-Redirect:

Ring Buffer (SPSC):

  • Lock-based: 45ns per operation
  • Lock-free: 8ns per operation
  • 5.6x faster

Queue (MPSC):

  • Lock-based: 120ns per operation
  • Lock-free: 35ns per operation
  • 3.4x faster

Hash Table:

  • Lock-based: 180ns lookup
  • Lock-free: 95ns lookup
  • 1.9x faster

System-wide impact:

  • Overall throughput: +18%
  • CPU utilization: -22%
  • Tail latency: -35%

When Locks Are Better

Lock-free isn’t always faster. Locks are better when:

  1. Complex critical sections: Multiple steps, error handling
  2. Low contention: Uncontested locks are very fast
  3. Need to wait: Sometimes blocking is the right behavior
  4. Simplicity matters: Locks are easier to reason about

Example where I reverted to locks:

// Complex flow table resize operation
void resize_flow_table(flow_table_t *table) {
    pthread_mutex_lock(&table->resize_lock);

    // Multi-step operation:
    // 1. Allocate new table
    // 2. Rehash all entries
    // 3. Update pointers
    // 4. Free old table

    // This would be nightmare to make lock-free
    // Locks are fine since resize is rare

    pthread_mutex_unlock(&table->resize_lock);
}

Lessons Learned

After months of lock-free programming:

  1. Start with locks: Only go lock-free when profiling shows contention.

  2. Use proven algorithms: Don’t invent your own. Use well-known algorithms.

  3. Test thoroughly: Lock-free bugs are subtle and rare. Stress test everything.

  4. Document memory ordering: Future you will forget why you used acquire/release.

  5. Keep it simple: The simplest lock-free design is usually best.

  6. Know when to stop: Some problems aren’t worth making lock-free.

Resources

Key resources that helped me:

  • “The Art of Multiprocessor Programming” by Herlihy & Shavit
  • “C++ Concurrency in Action” by Anthony Williams
  • Linux kernel lock-free structures
  • 1024cores.net (Dmitry Vyukov’s blog)
  • Preshing on Programming blog

Conclusion

Lock-free data structures are powerful tools, but they’re not magic. They solve specific problems (contention, priority inversion, real-time requirements) at the cost of complexity.

The key is knowing when to use them. Profile first, identify contention, then apply lock-free techniques selectively. Don’t make your entire codebase lock-free; make the critical paths lock-free.

In FC-Redirect, lock-free structures improved performance significantly in high-contention areas (packet processing, statistics collection). But most of our code still uses locks, and that’s fine.

Choose the right tool for the job. Sometimes that’s locks, sometimes it’s lock-free, sometimes it’s neither. Engineering is about tradeoffs.

Lock-free programming is a valuable skill, but wisdom is knowing when not to use it.