After two years of intensive performance work on FC-Redirect, I’ve accumulated a collection of code optimization techniques that consistently deliver results. These aren’t theoretical optimizations from textbooks; they’re battle-tested practices that improved production performance. Let me share the most impactful ones.
Principle 1: Profile Before Optimizing
This is the cardinal rule, but it bears repeating: always profile before optimizing.
Bad Example
// Someone's "optimization" without profiling
void process_packet_bad(packet_t *pkt) {
// Replaced modulo with bitwise AND (classic optimization)
int index = hash(pkt) & (TABLE_SIZE - 1); // "Faster than %"
// But profiling shows hash() takes 95% of the time!
// Optimizing the modulo saved 0.1%, wasted hours of dev time
}
Good Example
// Profile first
$ perf record -g ./fc_redirect
$ perf report
// Output shows:
// 87% time in hash_function()
// 8% time in flow_lookup()
// 5% everything else
// Now optimize what matters: the hash function
Lesson: Premature optimization is the root of all evil. Profile first.
Principle 2: Cache Is King
Modern CPUs are fast, but memory is slow. Cache misses dominate performance.
Data Structure Layout
// Bad: Mixed hot and cold data
typedef struct flow_entry_bad {
wwpn_t key; // Hot (every lookup)
uint64_t packets; // Hot
uint64_t bytes; // Hot
char description[256]; // Cold (rarely accessed)
timestamp_t created; // Cold
char metadata[512]; // Cold
// Total: 800 bytes (wastes cache lines)
} flow_entry_bad_t;
// Good: Separate hot and cold data
typedef struct flow_entry_hot {
wwpn_t key; // 8 bytes
uint64_t packets; // 8 bytes
uint64_t bytes; // 8 bytes
uint32_t cold_index; // 4 bytes (pointer to cold data)
uint32_t padding; // 4 bytes
// Total: 32 bytes (fits in half a cache line)
} flow_entry_hot_t __attribute__((aligned(64)));
typedef struct flow_entry_cold {
char description[256];
timestamp_t created;
char metadata[512];
// Stored separately, accessed only when needed
} flow_entry_cold_t;
Impact: 3x faster lookups (cache miss rate: 78% → 15%)
Array of Structures vs Structure of Arrays
// Bad: Array of Structures (AoS)
typedef struct particle {
float x, y, z; // Position
float vx, vy, vz; // Velocity
float mass;
char padding[4];
} particle_t;
particle_t particles[10000];
// When updating positions, we only need x, y, z
// But we load entire particle (32 bytes) per cache line
void update_positions_aos() {
for (int i = 0; i < 10000; i++) {
particles[i].x += particles[i].vx * dt;
particles[i].y += particles[i].vy * dt;
particles[i].z += particles[i].vz * dt;
// Loading vx, vy, vz, mass, padding unnecessarily
}
}
// Good: Structure of Arrays (SoA)
typedef struct particles {
float *x, *y, *z; // Position arrays
float *vx, *vy, *vz; // Velocity arrays
float *mass;
} particles_t;
void update_positions_soa(particles_t *p) {
for (int i = 0; i < 10000; i++) {
p->x[i] += p->vx[i] * dt;
p->y[i] += p->vy[i] * dt;
p->z[i] += p->vz[i] * dt;
// Only loading position and velocity data (better cache utilization)
}
}
Impact: 2.5x faster (for this specific workload)
Principle 3: Branch Prediction Matters
CPU branch predictors are good, but predictable code is better.
Avoid Unpredictable Branches
// Bad: Unpredictable branch in hot loop
void process_packets_bad(packet_t *packets, int count) {
for (int i = 0; i < count; i++) {
if (packets[i].type == TYPE_A) { // 50/50 split, unpredictable
process_type_a(&packets[i]);
} else {
process_type_b(&packets[i]);
}
}
}
// Good: Separate loops (predictable branches)
void process_packets_good(packet_t *packets, int count) {
// First pass: process all TYPE_A
for (int i = 0; i < count; i++) {
if (packets[i].type == TYPE_A) {
process_type_a(&packets[i]);
}
}
// Second pass: process all TYPE_B
for (int i = 0; i < count; i++) {
if (packets[i].type == TYPE_B) {
process_type_b(&packets[i]);
}
}
}
// Better: Sort packets first, eliminate branches
void process_packets_best(packet_t *packets, int count) {
// Sort packets by type (one-time cost)
partition_packets_by_type(packets, count);
// Process without branches
int type_a_count = count_type_a(packets, count);
for (int i = 0; i < type_a_count; i++) {
process_type_a(&packets[i]); // No branch!
}
for (int i = type_a_count; i < count; i++) {
process_type_b(&packets[i]); // No branch!
}
}
Impact: 40% faster (branch misprediction penalty eliminated)
Branchless Programming
// Bad: Branches for min/max
int min_branchy(int a, int b) {
return (a < b) ? a : b; // Conditional branch
}
// Good: Branchless min
int min_branchless(int a, int b) {
return a - ((a - b) & -(a > b));
}
// Bad: Conditional update
if (value > threshold) {
counter++;
}
// Good: Branchless update
counter += (value > threshold); // Boolean converts to 0 or 1
Use branchless techniques in hot loops, but only after profiling confirms branches are the bottleneck.
Principle 4: SIMD When Appropriate
Single Instruction Multiple Data (SIMD) can provide 4-8x speedups for data-parallel operations.
Example: Hash Computation
#include <immintrin.h> // AVX2 intrinsics
// Bad: Scalar hash computation
uint32_t hash_scalar(const uint8_t *data, size_t len) {
uint32_t hash = 0;
for (size_t i = 0; i < len; i++) {
hash = hash * 31 + data[i];
}
return hash;
}
// Good: SIMD hash computation (processes 32 bytes at once)
uint32_t hash_simd(const uint8_t *data, size_t len) {
__m256i hash_vec = _mm256_setzero_si256();
__m256i mult_vec = _mm256_set1_epi8(31);
size_t i;
for (i = 0; i + 32 <= len; i += 32) {
__m256i data_vec = _mm256_loadu_si256((__m256i*)&data[i]);
hash_vec = _mm256_mullo_epi8(hash_vec, mult_vec);
hash_vec = _mm256_add_epi8(hash_vec, data_vec);
}
// Horizontal sum of hash_vec
uint32_t hash = horizontal_sum_epi8(hash_vec);
// Handle remaining bytes
for (; i < len; i++) {
hash = hash * 31 + data[i];
}
return hash;
}
Impact: 8x faster (for large data blocks)
Caveat: SIMD adds complexity. Only use when profiling shows significant benefit.
Principle 5: Lock Contention Is Expensive
Locks serialize parallel execution. Minimize lock scope and avoid shared locks.
Per-Thread Data
// Bad: Shared counter with lock
pthread_mutex_t counter_lock = PTHREAD_MUTEX_INITIALIZER;
uint64_t shared_counter = 0;
void increment_counter_locked() {
pthread_mutex_lock(&counter_lock);
shared_counter++;
pthread_mutex_unlock(&counter_lock);
}
// Good: Per-thread counters (lock-free)
__thread uint64_t thread_counter = 0;
void increment_counter_lockfree() {
thread_counter++; // No lock needed!
}
uint64_t get_total_count() {
uint64_t total = 0;
for (int i = 0; i < num_threads; i++) {
total += thread_counters[i];
}
return total;
}
Impact: 50x faster in high-contention scenarios
Read-Write Locks
// Bad: Exclusive lock for read-mostly workload
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
void reader() {
pthread_mutex_lock(&lock); // Serializes all readers!
read_data();
pthread_mutex_unlock(&lock);
}
// Good: Read-write lock
pthread_rwlock_t rwlock = PTHREAD_RWLOCK_INITIALIZER;
void reader() {
pthread_rwlock_rdlock(&rwlock); // Multiple readers can proceed
read_data();
pthread_rwlock_unlock(&rwlock);
}
void writer() {
pthread_rwlock_wrlock(&rwlock); // Exclusive write access
write_data();
pthread_rwlock_unlock(&rwlock);
}
Impact: 10x throughput for read-heavy workloads
Principle 6: Memory Allocation Is Expensive
Malloc/free are surprisingly expensive. Minimize allocations in hot paths.
Object Pooling
// Bad: Allocate/free in hot path
void process_packet_allocating(packet_t *pkt) {
buffer_t *buf = malloc(sizeof(buffer_t)); // Expensive!
process_buffer(buf);
free(buf); // Expensive!
}
// Good: Object pool
typedef struct buffer_pool {
buffer_t *free_list;
pthread_mutex_t lock;
} buffer_pool_t;
buffer_t* buffer_pool_alloc(buffer_pool_t *pool) {
pthread_mutex_lock(&pool->lock);
buffer_t *buf = pool->free_list;
if (buf) {
pool->free_list = buf->next;
pthread_mutex_unlock(&pool->lock);
return buf;
}
pthread_mutex_unlock(&pool->lock);
// Pool empty, allocate new buffer
return malloc(sizeof(buffer_t));
}
void buffer_pool_free(buffer_pool_t *pool, buffer_t *buf) {
pthread_mutex_lock(&pool->lock);
buf->next = pool->free_list;
pool->free_list = buf;
pthread_mutex_unlock(&pool->lock);
}
Impact: 20x faster (allocation overhead eliminated)
Stack Allocation
// Bad: Heap allocation for temporary buffer
void process_data_heap() {
char *buffer = malloc(1024);
do_work(buffer);
free(buffer);
}
// Good: Stack allocation
void process_data_stack() {
char buffer[1024]; // Stack-allocated, no malloc/free overhead
do_work(buffer);
}
// Even better: Variable-length arrays (C99)
void process_data_vla(size_t size) {
char buffer[size]; // Stack-allocated, size determined at runtime
do_work(buffer);
}
Impact: Allocation overhead eliminated entirely
Principle 7: Compiler Optimizations
Modern compilers are smart. Help them help you.
Restrict Keyword
// Bad: Compiler can't assume pointers don't alias
void vector_add(int *a, int *b, int *c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
// Compiler must reload a[i], b[i] each iteration (might alias c)
}
}
// Good: Tell compiler pointers don't alias
void vector_add_restrict(int * restrict a,
int * restrict b,
int * restrict c,
int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
// Compiler can optimize (no aliasing)
}
}
Impact: 2x faster (enables SIMD vectorization)
Likely/Unlikely Hints
// Bad: No branch hints
if (unlikely_error_condition()) {
handle_error();
}
process_normal_case();
// Good: Branch hints
if (__builtin_expect(unlikely_error_condition(), 0)) { // Unlikely
handle_error();
}
process_normal_case();
// Macro for readability
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
if (unlikely(error_condition)) {
handle_error();
}
Impact: 5-10% (better instruction cache utilization)
Principle 8: I/O Is the Bottleneck
Disk and network I/O dwarf CPU costs. Optimize I/O first.
Batching
// Bad: Write to disk for every event
void log_event_unbatched(event_t *event) {
write(log_fd, event, sizeof(event_t));
fsync(log_fd); // Expensive!
}
// Good: Batch writes
#define BATCH_SIZE 1000
event_t event_batch[BATCH_SIZE];
int batch_count = 0;
void log_event_batched(event_t *event) {
event_batch[batch_count++] = *event;
if (batch_count >= BATCH_SIZE) {
writev(log_fd, event_batch, batch_count * sizeof(event_t));
fsync(log_fd);
batch_count = 0;
}
}
Impact: 1000x fewer fsync calls, 100x faster
Asynchronous I/O
// Bad: Synchronous I/O blocks
void read_file_sync(const char *path) {
int fd = open(path, O_RDONLY);
char buffer[4096];
ssize_t bytes = read(fd, buffer, sizeof(buffer)); // Blocks!
process_data(buffer, bytes);
close(fd);
}
// Good: Asynchronous I/O (io_uring or aio)
void read_file_async(const char *path) {
struct io_uring ring;
io_uring_queue_init(8, &ring, 0);
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
int fd = open(path, O_RDONLY);
char *buffer = malloc(4096);
io_uring_prep_read(sqe, fd, buffer, 4096, 0);
io_uring_submit(&ring);
// Do other work while I/O is in progress
do_other_work();
// Wait for completion
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
process_data(buffer, cqe->res);
io_uring_cqe_seen(&ring, cqe);
close(fd);
free(buffer);
}
Impact: Overlaps I/O with computation, higher throughput
Real-World Results
These optimizations delivered measurable improvements to FC-Redirect:
| Optimization | Impact |
|---|---|
| Hot/cold data splitting | 3x faster lookups |
| Per-thread counters | 50x less contention |
| Object pooling | 20x faster allocation |
| Batched I/O | 100x faster writes |
| SIMD hash function | 8x faster hashing |
| Branch elimination | 40% speedup |
Overall system impact:
- Throughput: 2.9M → 4.2M packets/sec (+45%)
- Latency: 2.0μs → 1.4μs P99 (-30%)
- CPU utilization: 65% → 45% (-31%)
The Optimization Checklist
When optimizing code:
- ☐ Profile to find actual bottleneck
- ☐ Optimize algorithm first (O(n²) → O(n log n))
- ☐ Optimize data structures (cache-friendly layout)
- ☐ Reduce memory allocations (pooling, stack allocation)
- ☐ Minimize lock contention (per-thread data, RCU)
- ☐ Batch I/O operations
- ☐ Help the compiler (restrict, likely/unlikely)
- ☐ Consider SIMD (only for data-parallel operations)
- ☐ Measure improvement (verify optimization worked)
- ☐ Document why (future maintainers need context)
Conclusion
Optimization is a systematic process:
- Measure (profile)
- Understand (analyze bottleneck)
- Optimize (apply technique)
- Verify (measure again)
The techniques in this post are proven to work in production. But remember: correctness first, then optimize. Fast and wrong is useless.
After two years of optimization work, the biggest lesson is this: measure, don’t guess. Every hour spent profiling saves days of optimizing the wrong thing.
Optimization is a craft. These techniques are tools. Use them wisely.