Lock-free programming is often seen as dark magic: difficult, error-prone, and only for experts. After implementing several lock-free data structures in FC-Redirect, I’ve learned when they’re worth the complexity and how to implement them correctly. Let me share what I’ve learned.
Why Lock-Free?
Locks have problems at scale:
Contention: With 8 threads competing for a lock, average wait time can exceed the critical section itself.
Priority Inversion: Low-priority thread holds lock, blocking high-priority thread.
Deadlock Risk: Multiple locks can deadlock if acquired in wrong order.
Cache Line Bouncing: Lock variable ping-pongs between CPU caches.
Lock-free data structures eliminate these issues by using atomic compare-and-swap (CAS) operations instead of locks.
When to Use Lock-Free
Lock-free isn’t always better. Use it when:
- High contention: Multiple threads frequently accessing shared data
- Real-time requirements: Unbounded wait times are unacceptable
- Scalability critical: Need linear scaling across cores
- Simple operations: The data structure operations are straightforward
Don’t use lock-free when:
- Complex operations: Multi-step operations are hard to make lock-free
- Low contention: Locks are fine if rarely contested
- Development time constrained: Lock-free takes longer to implement correctly
Lock-Free Ring Buffer
The most useful lock-free structure I’ve implemented is a single-producer, single-consumer (SPSC) ring buffer:
typedef struct lockfree_ring_buffer {
void **data;
size_t capacity;
// Separate cache lines to avoid false sharing
alignas(64) atomic_size_t head; // Consumer owns
alignas(64) atomic_size_t tail; // Producer owns
// Padding to fill cache line
char padding[64 - sizeof(atomic_size_t)];
} lockfree_ring_buffer_t;
lockfree_ring_buffer_t* create_ring_buffer(size_t capacity) {
// Capacity must be power of 2 for fast modulo
assert((capacity & (capacity - 1)) == 0);
lockfree_ring_buffer_t *rb = aligned_alloc(64, sizeof(*rb));
rb->data = calloc(capacity, sizeof(void*));
rb->capacity = capacity;
atomic_store_explicit(&rb->head, 0, memory_order_relaxed);
atomic_store_explicit(&rb->tail, 0, memory_order_relaxed);
return rb;
}
Key design points:
-
Cache line alignment: Head and tail are in separate cache lines to prevent false sharing.
-
Power-of-2 capacity: Allows fast modulo using bitwise AND instead of division.
-
Single producer/consumer: Simplifies memory ordering requirements.
Producer (Enqueue)
bool ring_buffer_enqueue(lockfree_ring_buffer_t *rb, void *item) {
size_t tail = atomic_load_explicit(&rb->tail, memory_order_relaxed);
size_t head = atomic_load_explicit(&rb->head, memory_order_acquire);
// Check if full
size_t next_tail = (tail + 1) & (rb->capacity - 1);
if (next_tail == head) {
return false; // Full
}
// Write data
rb->data[tail] = item;
// Update tail (release semantics ensures data write is visible)
atomic_store_explicit(&rb->tail, next_tail, memory_order_release);
return true;
}
Memory ordering is critical:
- acquire on head read: Ensures we see consumer’s updates
- release on tail write: Ensures data write is visible before tail update
- relaxed on tail read: We’re the only writer, no synchronization needed
Consumer (Dequeue)
void* ring_buffer_dequeue(lockfree_ring_buffer_t *rb) {
size_t head = atomic_load_explicit(&rb->head, memory_order_relaxed);
size_t tail = atomic_load_explicit(&rb->tail, memory_order_acquire);
// Check if empty
if (head == tail) {
return NULL; // Empty
}
// Read data
void *item = rb->data[head];
// Update head (release semantics for symmetry, not strictly required)
size_t next_head = (head + 1) & (rb->capacity - 1);
atomic_store_explicit(&rb->head, next_head, memory_order_release);
return item;
}
This SPSC ring buffer is incredibly fast: single-digit nanoseconds per operation on modern CPUs.
Multi-Producer, Single-Consumer Queue
For multiple producers, we need compare-and-swap:
typedef struct mpsc_node {
void *data;
atomic_uintptr_t next;
} mpsc_node_t;
typedef struct mpsc_queue {
alignas(64) atomic_uintptr_t head; // Consumer owns
alignas(64) atomic_uintptr_t tail; // Producers compete for
mpsc_node_t stub; // Dummy node to simplify implementation
} mpsc_queue_t;
void mpsc_queue_init(mpsc_queue_t *q) {
q->stub.data = NULL;
atomic_store(&q->stub.next, 0);
atomic_store(&q->head, (uintptr_t)&q->stub);
atomic_store(&q->tail, (uintptr_t)&q->stub);
}
Enqueue (Multiple Producers)
void mpsc_queue_enqueue(mpsc_queue_t *q, void *data) {
// Allocate node
mpsc_node_t *node = malloc(sizeof(mpsc_node_t));
node->data = data;
atomic_store(&node->next, 0);
// Atomically swap tail
mpsc_node_t *prev = (mpsc_node_t*)atomic_exchange(&q->tail,
(uintptr_t)node);
// Link previous tail to new node
atomic_store(&prev->next, (uintptr_t)node);
}
The atomic exchange on tail ensures only one producer successfully claims the tail position at a time.
Dequeue (Single Consumer)
void* mpsc_queue_dequeue(mpsc_queue_t *q) {
mpsc_node_t *head = (mpsc_node_t*)atomic_load(&q->head);
mpsc_node_t *next = (mpsc_node_t*)atomic_load(&head->next);
if (next == NULL) {
return NULL; // Empty
}
void *data = next->data;
atomic_store(&q->head, (uintptr_t)next);
// Free old head (was stub or previous node)
if (head != &q->stub) {
free(head);
}
return data;
}
Lock-Free Hash Table
For our flow lookup, I implemented a lock-free hash table using open addressing:
typedef struct lf_hash_entry {
atomic_uint64_t key;
atomic_uint64_t value;
atomic_uint32_t version; // For ABA problem
} lf_hash_entry_t;
typedef struct lf_hash_table {
lf_hash_entry_t *entries;
size_t capacity;
size_t mask; // capacity - 1
atomic_size_t count;
} lf_hash_table_t;
#define EMPTY_KEY 0
#define DELETED_KEY ((uint64_t)-1)
Lock-Free Lookup
uint64_t lf_hash_lookup(lf_hash_table_t *table, uint64_t key) {
assert(key != EMPTY_KEY && key != DELETED_KEY);
uint64_t hash = hash_key(key);
size_t index = hash & table->mask;
// Linear probing
for (size_t i = 0; i < table->capacity; i++) {
lf_hash_entry_t *entry = &table->entries[index];
uint64_t entry_key = atomic_load_explicit(&entry->key,
memory_order_acquire);
if (entry_key == key) {
// Found it
return atomic_load_explicit(&entry->value,
memory_order_acquire);
}
if (entry_key == EMPTY_KEY) {
// Not found
return 0;
}
// Collision, try next slot
index = (index + 1) & table->mask;
}
// Table full
return 0;
}
Lock-Free Insert
bool lf_hash_insert(lf_hash_table_t *table, uint64_t key, uint64_t value) {
assert(key != EMPTY_KEY && key != DELETED_KEY);
assert(value != 0);
uint64_t hash = hash_key(key);
size_t index = hash & table->mask;
for (size_t i = 0; i < table->capacity; i++) {
lf_hash_entry_t *entry = &table->entries[index];
uint64_t entry_key = atomic_load_explicit(&entry->key,
memory_order_acquire);
if (entry_key == key) {
// Key exists, update value
atomic_store_explicit(&entry->value, value,
memory_order_release);
return true;
}
if (entry_key == EMPTY_KEY || entry_key == DELETED_KEY) {
// Try to claim this slot
uint64_t expected = entry_key;
if (atomic_compare_exchange_strong(&entry->key, &expected, key)) {
// Successfully claimed, write value
atomic_store_explicit(&entry->value, value,
memory_order_release);
atomic_fetch_add(&table->count, 1);
return true;
}
// CAS failed, someone else claimed it, retry
i--; // Try this slot again
continue;
}
// Collision, try next slot
index = (index + 1) & table->mask;
}
// Table full
return false;
}
The CAS ensures only one thread successfully claims an empty slot.
Lock-Free Delete
Deletion is tricky in lock-free open-addressed hash tables:
bool lf_hash_delete(lf_hash_table_t *table, uint64_t key) {
uint64_t hash = hash_key(key);
size_t index = hash & table->mask;
for (size_t i = 0; i < table->capacity; i++) {
lf_hash_entry_t *entry = &table->entries[index];
uint64_t entry_key = atomic_load(&entry->key);
if (entry_key == key) {
// Mark as deleted
atomic_store(&entry->key, DELETED_KEY);
atomic_store(&entry->value, 0);
atomic_fetch_sub(&table->count, 1);
return true;
}
if (entry_key == EMPTY_KEY) {
// Not found
return false;
}
index = (index + 1) & table->mask;
}
return false;
}
We mark slots as DELETED rather than EMPTY to preserve probe chains.
The ABA Problem
Lock-free data structures are vulnerable to the ABA problem:
Thread 1: Thread 2:
Read value A
Change A to B
Change B back to A
CAS(A, new_value)
succeeds! (but A isn't the same A)
Solution: Use versioned pointers or generation counters:
typedef struct versioned_ptr {
void *ptr;
uint64_t version;
} versioned_ptr_t;
// Pack into 128-bit value for double-wide CAS
typedef __int128 versioned_ptr_atomic_t;
bool versioned_cas(atomic_versioned_ptr_t *target,
versioned_ptr_t expected,
versioned_ptr_t desired) {
__int128 expected_packed = pack_versioned_ptr(expected);
__int128 desired_packed = pack_versioned_ptr(desired);
return atomic_compare_exchange_strong(target,
&expected_packed,
desired_packed);
}
The version counter prevents the ABA problem by changing even when the pointer doesn’t.
Memory Ordering: The Subtle Killer
Memory ordering is where lock-free programming gets dangerous. Here’s my mental model:
Relaxed
atomic_store_explicit(&x, value, memory_order_relaxed);
No synchronization. Only guarantees atomicity, not ordering. Use when:
- Single variable
- No dependencies on other variables
- Performance critical
Acquire/Release
// Thread 1
atomic_store_explicit(&data, value, memory_order_release);
// Thread 2
value = atomic_load_explicit(&data, memory_order_acquire);
Forms a synchronization point. All writes before release are visible after acquire. Use for:
- Producer/consumer communication
- Publishing data
- Most lock-free structures
Sequential Consistency
atomic_store(&x, value); // Default is seq_cst
Strongest ordering. Global order of all seq_cst operations. Use when:
- Correctness is unclear with weaker ordering
- Performance isn’t critical
- During development (optimize later)
Testing Lock-Free Code
Lock-free bugs are Heisenbugs. They disappear when you add logging or debuggers. My testing strategy:
Stress Test
void stress_test_ring_buffer() {
lockfree_ring_buffer_t *rb = create_ring_buffer(1024);
atomic_bool running = true;
atomic_uint64_t enqueued = 0;
atomic_uint64_t dequeued = 0;
// Producer thread
pthread_t producer;
pthread_create(&producer, NULL, [](void *arg) {
auto *rb = (lockfree_ring_buffer_t*)arg;
while (atomic_load(&running)) {
if (ring_buffer_enqueue(rb, (void*)(uintptr_t)123)) {
atomic_fetch_add(&enqueued, 1);
}
}
return NULL;
}, rb);
// Consumer thread
pthread_t consumer;
pthread_create(&consumer, NULL, [](void *arg) {
auto *rb = (lockfree_ring_buffer_t*)arg;
while (atomic_load(&running)) {
if (ring_buffer_dequeue(rb) != NULL) {
atomic_fetch_add(&dequeued, 1);
}
}
return NULL;
}, rb);
// Run for 60 seconds
sleep(60);
atomic_store(&running, false);
pthread_join(producer, NULL);
pthread_join(consumer, NULL);
// Verify no items lost
assert(enqueued == dequeued);
}
Thread Sanitizer
Clang’s thread sanitizer detects data races:
clang -fsanitize=thread -g -O2 test.c -o test
./test
It caught several subtle bugs in my initial implementations.
Formal Verification
For critical structures, I use model checking:
// Simplified model for CBMC model checker
void model_ring_buffer() {
size_t head, tail;
void *data[4]; // Small size for model checking
// Nondeterministic operations
if (nondet_bool()) {
// Enqueue
size_t next_tail = (tail + 1) & 3;
__CPROVER_assume(next_tail != head); // Assume not full
data[tail] = nondet_ptr();
tail = next_tail;
} else {
// Dequeue
__CPROVER_assume(head != tail); // Assume not empty
void *item = data[head];
__CPROVER_assert(item != NULL, "dequeued item should be valid");
head = (head + 1) & 3;
}
}
Performance Results
Lock-free structures delivered significant improvements in FC-Redirect:
Ring Buffer (SPSC):
- Lock-based: 45ns per operation
- Lock-free: 8ns per operation
- 5.6x faster
Queue (MPSC):
- Lock-based: 120ns per operation
- Lock-free: 35ns per operation
- 3.4x faster
Hash Table:
- Lock-based: 180ns lookup
- Lock-free: 95ns lookup
- 1.9x faster
System-wide impact:
- Overall throughput: +18%
- CPU utilization: -22%
- Tail latency: -35%
When Locks Are Better
Lock-free isn’t always faster. Locks are better when:
- Complex critical sections: Multiple steps, error handling
- Low contention: Uncontested locks are very fast
- Need to wait: Sometimes blocking is the right behavior
- Simplicity matters: Locks are easier to reason about
Example where I reverted to locks:
// Complex flow table resize operation
void resize_flow_table(flow_table_t *table) {
pthread_mutex_lock(&table->resize_lock);
// Multi-step operation:
// 1. Allocate new table
// 2. Rehash all entries
// 3. Update pointers
// 4. Free old table
// This would be nightmare to make lock-free
// Locks are fine since resize is rare
pthread_mutex_unlock(&table->resize_lock);
}
Lessons Learned
After months of lock-free programming:
-
Start with locks: Only go lock-free when profiling shows contention.
-
Use proven algorithms: Don’t invent your own. Use well-known algorithms.
-
Test thoroughly: Lock-free bugs are subtle and rare. Stress test everything.
-
Document memory ordering: Future you will forget why you used acquire/release.
-
Keep it simple: The simplest lock-free design is usually best.
-
Know when to stop: Some problems aren’t worth making lock-free.
Resources
Key resources that helped me:
- “The Art of Multiprocessor Programming” by Herlihy & Shavit
- “C++ Concurrency in Action” by Anthony Williams
- Linux kernel lock-free structures
- 1024cores.net (Dmitry Vyukov’s blog)
- Preshing on Programming blog
Conclusion
Lock-free data structures are powerful tools, but they’re not magic. They solve specific problems (contention, priority inversion, real-time requirements) at the cost of complexity.
The key is knowing when to use them. Profile first, identify contention, then apply lock-free techniques selectively. Don’t make your entire codebase lock-free; make the critical paths lock-free.
In FC-Redirect, lock-free structures improved performance significantly in high-contention areas (packet processing, statistics collection). But most of our code still uses locks, and that’s fine.
Choose the right tool for the job. Sometimes that’s locks, sometimes it’s lock-free, sometimes it’s neither. Engineering is about tradeoffs.
Lock-free programming is a valuable skill, but wisdom is knowing when not to use it.