After two years of intensive performance work on FC-Redirect, I’ve accumulated a collection of code optimization techniques that consistently deliver results. These aren’t theoretical optimizations from textbooks; they’re battle-tested practices that improved production performance. Let me share the most impactful ones.

Principle 1: Profile Before Optimizing

This is the cardinal rule, but it bears repeating: always profile before optimizing.

Bad Example

// Someone's "optimization" without profiling
void process_packet_bad(packet_t *pkt) {
    // Replaced modulo with bitwise AND (classic optimization)
    int index = hash(pkt) & (TABLE_SIZE - 1);  // "Faster than %"

    // But profiling shows hash() takes 95% of the time!
    // Optimizing the modulo saved 0.1%, wasted hours of dev time
}

Good Example

// Profile first
$ perf record -g ./fc_redirect
$ perf report

// Output shows:
// 87% time in hash_function()
// 8% time in flow_lookup()
// 5% everything else

// Now optimize what matters: the hash function

Lesson: Premature optimization is the root of all evil. Profile first.

Principle 2: Cache Is King

Modern CPUs are fast, but memory is slow. Cache misses dominate performance.

Data Structure Layout

// Bad: Mixed hot and cold data
typedef struct flow_entry_bad {
    wwpn_t key;              // Hot (every lookup)
    uint64_t packets;        // Hot
    uint64_t bytes;          // Hot
    char description[256];   // Cold (rarely accessed)
    timestamp_t created;     // Cold
    char metadata[512];      // Cold
    // Total: 800 bytes (wastes cache lines)
} flow_entry_bad_t;

// Good: Separate hot and cold data
typedef struct flow_entry_hot {
    wwpn_t key;              // 8 bytes
    uint64_t packets;        // 8 bytes
    uint64_t bytes;          // 8 bytes
    uint32_t cold_index;     // 4 bytes (pointer to cold data)
    uint32_t padding;        // 4 bytes
    // Total: 32 bytes (fits in half a cache line)
} flow_entry_hot_t __attribute__((aligned(64)));

typedef struct flow_entry_cold {
    char description[256];
    timestamp_t created;
    char metadata[512];
    // Stored separately, accessed only when needed
} flow_entry_cold_t;

Impact: 3x faster lookups (cache miss rate: 78% → 15%)

Array of Structures vs Structure of Arrays

// Bad: Array of Structures (AoS)
typedef struct particle {
    float x, y, z;       // Position
    float vx, vy, vz;    // Velocity
    float mass;
    char padding[4];
} particle_t;

particle_t particles[10000];

// When updating positions, we only need x, y, z
// But we load entire particle (32 bytes) per cache line
void update_positions_aos() {
    for (int i = 0; i < 10000; i++) {
        particles[i].x += particles[i].vx * dt;
        particles[i].y += particles[i].vy * dt;
        particles[i].z += particles[i].vz * dt;
        // Loading vx, vy, vz, mass, padding unnecessarily
    }
}

// Good: Structure of Arrays (SoA)
typedef struct particles {
    float *x, *y, *z;         // Position arrays
    float *vx, *vy, *vz;      // Velocity arrays
    float *mass;
} particles_t;

void update_positions_soa(particles_t *p) {
    for (int i = 0; i < 10000; i++) {
        p->x[i] += p->vx[i] * dt;
        p->y[i] += p->vy[i] * dt;
        p->z[i] += p->vz[i] * dt;
        // Only loading position and velocity data (better cache utilization)
    }
}

Impact: 2.5x faster (for this specific workload)

Principle 3: Branch Prediction Matters

CPU branch predictors are good, but predictable code is better.

Avoid Unpredictable Branches

// Bad: Unpredictable branch in hot loop
void process_packets_bad(packet_t *packets, int count) {
    for (int i = 0; i < count; i++) {
        if (packets[i].type == TYPE_A) {  // 50/50 split, unpredictable
            process_type_a(&packets[i]);
        } else {
            process_type_b(&packets[i]);
        }
    }
}

// Good: Separate loops (predictable branches)
void process_packets_good(packet_t *packets, int count) {
    // First pass: process all TYPE_A
    for (int i = 0; i < count; i++) {
        if (packets[i].type == TYPE_A) {
            process_type_a(&packets[i]);
        }
    }

    // Second pass: process all TYPE_B
    for (int i = 0; i < count; i++) {
        if (packets[i].type == TYPE_B) {
            process_type_b(&packets[i]);
        }
    }
}

// Better: Sort packets first, eliminate branches
void process_packets_best(packet_t *packets, int count) {
    // Sort packets by type (one-time cost)
    partition_packets_by_type(packets, count);

    // Process without branches
    int type_a_count = count_type_a(packets, count);

    for (int i = 0; i < type_a_count; i++) {
        process_type_a(&packets[i]);  // No branch!
    }

    for (int i = type_a_count; i < count; i++) {
        process_type_b(&packets[i]);  // No branch!
    }
}

Impact: 40% faster (branch misprediction penalty eliminated)

Branchless Programming

// Bad: Branches for min/max
int min_branchy(int a, int b) {
    return (a < b) ? a : b;  // Conditional branch
}

// Good: Branchless min
int min_branchless(int a, int b) {
    return a - ((a - b) & -(a > b));
}

// Bad: Conditional update
if (value > threshold) {
    counter++;
}

// Good: Branchless update
counter += (value > threshold);  // Boolean converts to 0 or 1

Use branchless techniques in hot loops, but only after profiling confirms branches are the bottleneck.

Principle 4: SIMD When Appropriate

Single Instruction Multiple Data (SIMD) can provide 4-8x speedups for data-parallel operations.

Example: Hash Computation

#include <immintrin.h>  // AVX2 intrinsics

// Bad: Scalar hash computation
uint32_t hash_scalar(const uint8_t *data, size_t len) {
    uint32_t hash = 0;
    for (size_t i = 0; i < len; i++) {
        hash = hash * 31 + data[i];
    }
    return hash;
}

// Good: SIMD hash computation (processes 32 bytes at once)
uint32_t hash_simd(const uint8_t *data, size_t len) {
    __m256i hash_vec = _mm256_setzero_si256();
    __m256i mult_vec = _mm256_set1_epi8(31);

    size_t i;
    for (i = 0; i + 32 <= len; i += 32) {
        __m256i data_vec = _mm256_loadu_si256((__m256i*)&data[i]);
        hash_vec = _mm256_mullo_epi8(hash_vec, mult_vec);
        hash_vec = _mm256_add_epi8(hash_vec, data_vec);
    }

    // Horizontal sum of hash_vec
    uint32_t hash = horizontal_sum_epi8(hash_vec);

    // Handle remaining bytes
    for (; i < len; i++) {
        hash = hash * 31 + data[i];
    }

    return hash;
}

Impact: 8x faster (for large data blocks)

Caveat: SIMD adds complexity. Only use when profiling shows significant benefit.

Principle 5: Lock Contention Is Expensive

Locks serialize parallel execution. Minimize lock scope and avoid shared locks.

Per-Thread Data

// Bad: Shared counter with lock
pthread_mutex_t counter_lock = PTHREAD_MUTEX_INITIALIZER;
uint64_t shared_counter = 0;

void increment_counter_locked() {
    pthread_mutex_lock(&counter_lock);
    shared_counter++;
    pthread_mutex_unlock(&counter_lock);
}

// Good: Per-thread counters (lock-free)
__thread uint64_t thread_counter = 0;

void increment_counter_lockfree() {
    thread_counter++;  // No lock needed!
}

uint64_t get_total_count() {
    uint64_t total = 0;
    for (int i = 0; i < num_threads; i++) {
        total += thread_counters[i];
    }
    return total;
}

Impact: 50x faster in high-contention scenarios

Read-Write Locks

// Bad: Exclusive lock for read-mostly workload
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

void reader() {
    pthread_mutex_lock(&lock);  // Serializes all readers!
    read_data();
    pthread_mutex_unlock(&lock);
}

// Good: Read-write lock
pthread_rwlock_t rwlock = PTHREAD_RWLOCK_INITIALIZER;

void reader() {
    pthread_rwlock_rdlock(&rwlock);  // Multiple readers can proceed
    read_data();
    pthread_rwlock_unlock(&rwlock);
}

void writer() {
    pthread_rwlock_wrlock(&rwlock);  // Exclusive write access
    write_data();
    pthread_rwlock_unlock(&rwlock);
}

Impact: 10x throughput for read-heavy workloads

Principle 6: Memory Allocation Is Expensive

Malloc/free are surprisingly expensive. Minimize allocations in hot paths.

Object Pooling

// Bad: Allocate/free in hot path
void process_packet_allocating(packet_t *pkt) {
    buffer_t *buf = malloc(sizeof(buffer_t));  // Expensive!
    process_buffer(buf);
    free(buf);  // Expensive!
}

// Good: Object pool
typedef struct buffer_pool {
    buffer_t *free_list;
    pthread_mutex_t lock;
} buffer_pool_t;

buffer_t* buffer_pool_alloc(buffer_pool_t *pool) {
    pthread_mutex_lock(&pool->lock);

    buffer_t *buf = pool->free_list;
    if (buf) {
        pool->free_list = buf->next;
        pthread_mutex_unlock(&pool->lock);
        return buf;
    }

    pthread_mutex_unlock(&pool->lock);

    // Pool empty, allocate new buffer
    return malloc(sizeof(buffer_t));
}

void buffer_pool_free(buffer_pool_t *pool, buffer_t *buf) {
    pthread_mutex_lock(&pool->lock);
    buf->next = pool->free_list;
    pool->free_list = buf;
    pthread_mutex_unlock(&pool->lock);
}

Impact: 20x faster (allocation overhead eliminated)

Stack Allocation

// Bad: Heap allocation for temporary buffer
void process_data_heap() {
    char *buffer = malloc(1024);
    do_work(buffer);
    free(buffer);
}

// Good: Stack allocation
void process_data_stack() {
    char buffer[1024];  // Stack-allocated, no malloc/free overhead
    do_work(buffer);
}

// Even better: Variable-length arrays (C99)
void process_data_vla(size_t size) {
    char buffer[size];  // Stack-allocated, size determined at runtime
    do_work(buffer);
}

Impact: Allocation overhead eliminated entirely

Principle 7: Compiler Optimizations

Modern compilers are smart. Help them help you.

Restrict Keyword

// Bad: Compiler can't assume pointers don't alias
void vector_add(int *a, int *b, int *c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
        // Compiler must reload a[i], b[i] each iteration (might alias c)
    }
}

// Good: Tell compiler pointers don't alias
void vector_add_restrict(int * restrict a,
                        int * restrict b,
                        int * restrict c,
                        int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
        // Compiler can optimize (no aliasing)
    }
}

Impact: 2x faster (enables SIMD vectorization)

Likely/Unlikely Hints

// Bad: No branch hints
if (unlikely_error_condition()) {
    handle_error();
}
process_normal_case();

// Good: Branch hints
if (__builtin_expect(unlikely_error_condition(), 0)) {  // Unlikely
    handle_error();
}
process_normal_case();

// Macro for readability
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)

if (unlikely(error_condition)) {
    handle_error();
}

Impact: 5-10% (better instruction cache utilization)

Principle 8: I/O Is the Bottleneck

Disk and network I/O dwarf CPU costs. Optimize I/O first.

Batching

// Bad: Write to disk for every event
void log_event_unbatched(event_t *event) {
    write(log_fd, event, sizeof(event_t));
    fsync(log_fd);  // Expensive!
}

// Good: Batch writes
#define BATCH_SIZE 1000
event_t event_batch[BATCH_SIZE];
int batch_count = 0;

void log_event_batched(event_t *event) {
    event_batch[batch_count++] = *event;

    if (batch_count >= BATCH_SIZE) {
        writev(log_fd, event_batch, batch_count * sizeof(event_t));
        fsync(log_fd);
        batch_count = 0;
    }
}

Impact: 1000x fewer fsync calls, 100x faster

Asynchronous I/O

// Bad: Synchronous I/O blocks
void read_file_sync(const char *path) {
    int fd = open(path, O_RDONLY);
    char buffer[4096];
    ssize_t bytes = read(fd, buffer, sizeof(buffer));  // Blocks!
    process_data(buffer, bytes);
    close(fd);
}

// Good: Asynchronous I/O (io_uring or aio)
void read_file_async(const char *path) {
    struct io_uring ring;
    io_uring_queue_init(8, &ring, 0);

    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    int fd = open(path, O_RDONLY);

    char *buffer = malloc(4096);

    io_uring_prep_read(sqe, fd, buffer, 4096, 0);
    io_uring_submit(&ring);

    // Do other work while I/O is in progress
    do_other_work();

    // Wait for completion
    struct io_uring_cqe *cqe;
    io_uring_wait_cqe(&ring, &cqe);

    process_data(buffer, cqe->res);

    io_uring_cqe_seen(&ring, cqe);
    close(fd);
    free(buffer);
}

Impact: Overlaps I/O with computation, higher throughput

Real-World Results

These optimizations delivered measurable improvements to FC-Redirect:

OptimizationImpact
Hot/cold data splitting3x faster lookups
Per-thread counters50x less contention
Object pooling20x faster allocation
Batched I/O100x faster writes
SIMD hash function8x faster hashing
Branch elimination40% speedup

Overall system impact:

  • Throughput: 2.9M → 4.2M packets/sec (+45%)
  • Latency: 2.0μs → 1.4μs P99 (-30%)
  • CPU utilization: 65% → 45% (-31%)

The Optimization Checklist

When optimizing code:

  1. ☐ Profile to find actual bottleneck
  2. ☐ Optimize algorithm first (O(n²) → O(n log n))
  3. ☐ Optimize data structures (cache-friendly layout)
  4. ☐ Reduce memory allocations (pooling, stack allocation)
  5. ☐ Minimize lock contention (per-thread data, RCU)
  6. ☐ Batch I/O operations
  7. ☐ Help the compiler (restrict, likely/unlikely)
  8. ☐ Consider SIMD (only for data-parallel operations)
  9. ☐ Measure improvement (verify optimization worked)
  10. ☐ Document why (future maintainers need context)

Conclusion

Optimization is a systematic process:

  1. Measure (profile)
  2. Understand (analyze bottleneck)
  3. Optimize (apply technique)
  4. Verify (measure again)

The techniques in this post are proven to work in production. But remember: correctness first, then optimize. Fast and wrong is useless.

After two years of optimization work, the biggest lesson is this: measure, don’t guess. Every hour spent profiling saves days of optimizing the wrong thing.

Optimization is a craft. These techniques are tools. Use them wisely.