My Performance Optimization Workflow: Profile, Understand, Optimize

Over the past year, I’ve spent countless hours optimizing FC-Redirect. Through trial and error, I’ve developed a systematic workflow that consistently produces results. Today I want to share this approach, along with the tools and techniques that make it effective.

The Three Commandments of Optimization

Before diving into specifics, three principles guide all my optimization work:

Measure, don’t guess: Intuition is often wrong. Always profile before optimizing.
Optimize the critical path: Focus on what matters. 90% of time is spent in 10% of code.
Validate improvements: Measure after optimization to confirm the improvement is real.

These seem obvious, but I’ve wasted weeks ignoring them.

Step 1: Establish a Baseline

Before optimizing anything, I establish a comprehensive baseline:

Performance Metrics

typedef struct performance_baseline {
    // Throughput
    uint64_t packets_per_second;
    uint64_t bytes_per_second;

    // Latency
    latency_histogram_t latency_dist;
    uint64_t p50_latency_ns;
    uint64_t p99_latency_ns;
    uint64_t p999_latency_ns;

    // Resource utilization
    float cpu_utilization;
    float memory_utilization;
    uint64_t cache_misses;
    uint64_t tlb_misses;

    // System characteristics
    uint64_t context_switches_per_sec;
    uint64_t syscalls_per_sec;
} performance_baseline_t;

performance_baseline_t capture_baseline(test_workload_t *workload) {
    performance_baseline_t baseline = {0};

    // Run workload
    start_perf_monitoring();
    run_workload(workload, 60*SECONDS);
    stop_perf_monitoring();

    // Collect metrics
    baseline.packets_per_second = get_packet_rate();
    baseline.latency_dist = get_latency_histogram();
    baseline.p50_latency_ns = histogram_percentile(&baseline.latency_dist, 50);
    baseline.p99_latency_ns = histogram_percentile(&baseline.latency_dist, 99);
    baseline.p999_latency_ns = histogram_percentile(&baseline.latency_dist, 99.9);

    // CPU metrics
    baseline.cpu_utilization = get_cpu_utilization();
    baseline.cache_misses = get_perf_counter(PERF_COUNT_CACHE_MISSES);
    baseline.tlb_misses = get_perf_counter(PERF_COUNT_TLB_MISSES);

    return baseline;
}

This baseline becomes my reference point. Any optimization must improve these metrics.

Workload Characterization

I create representative workloads that stress different aspects:

// Workload 1: Many small flows (latency-sensitive)
test_workload_t create_latency_workload() {
    return (test_workload_t) {
        .num_flows = 10000,
        .flow_size_bytes = 1024,  // 1KB flows
        .arrival_rate = 50000,     // 50K flows/sec
        .duration_seconds = 60
    };
}

// Workload 2: Few large flows (throughput-focused)
test_workload_t create_throughput_workload() {
    return (test_workload_t) {
        .num_flows = 100,
        .flow_size_bytes = 100*1024*1024,  // 100MB flows
        .arrival_rate = 10,
        .duration_seconds = 60
    };
}

// Workload 3: Mixed (realistic)
test_workload_t create_mixed_workload() {
    return (test_workload_t) {
        .num_flows = 5000,
        .flow_size_distribution = DIST_PARETO,  // 80/20 rule
        .arrival_pattern = PATTERN_BURSTY,
        .duration_seconds = 60
    };
}

Different workloads reveal different bottlenecks. The latency workload exposes lookup overhead, throughput workload reveals bandwidth limits, and mixed workload shows real-world behavior.

Step 2: Profiling

With a baseline established, I profile to find bottlenecks. I use multiple tools:

CPU Profiling with perf

For CPU-bound workloads, Linux perf is indispensable:

# Record profile
perf record -g -F 999 ./fc_redirect_test

# Analyze profile
perf report --stdio --no-children | head -50

This shows where CPU time is spent. A typical output:

47.32%  [.] flow_table_lookup
12.18%  [.] wwpn_hash
 8.91%  [.] update_flow_statistics
 6.47%  [.] process_packet
 ...

This immediately tells me that flow lookup is the bottleneck consuming 47% of CPU time.

Flamegraphs

For visual analysis, I generate flamegraphs:

perf record -g -F 999 ./fc_redirect_test
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > profile.svg

Flamegraphs make it obvious where time is spent and how functions call each other. Wide bars = hot paths.

Hardware Performance Counters

For deeper analysis, I use hardware counters:

# Cache miss analysis
perf stat -e cache-references,cache-misses,L1-dcache-load-misses \
    ./fc_redirect_test

# Branch prediction analysis
perf stat -e branches,branch-misses \
    ./fc_redirect_test

# Memory bandwidth
perf stat -e cpu/event=0xb7,umask=0x1/ \
    ./fc_redirect_test

This reveals microarchitectural issues like cache misses or branch mispredictions.

Custom Instrumentation

For application-specific insights, I add custom timing:

#define TIMING_ENABLED 1

typedef struct timing_info {
    const char *name;
    uint64_t total_time;
    uint64_t call_count;
} timing_info_t;

static timing_info_t timings[100];
static int num_timings = 0;

#define TIME_SECTION_START(name) \
    uint64_t _time_start_##name = rdtsc(); \
    int _time_index_##name = get_or_create_timing(#name);

#define TIME_SECTION_END(name) \
    uint64_t _time_end_##name = rdtsc(); \
    timings[_time_index_##name].total_time += \
        (_time_end_##name - _time_start_##name); \
    timings[_time_index_##name].call_count++;

// Usage
void process_flow(flow_entry_t *flow) {
    TIME_SECTION_START(lookup);
    flow_entry_t *entry = flow_table_lookup(flow->key);
    TIME_SECTION_END(lookup);

    TIME_SECTION_START(update);
    update_flow_statistics(entry);
    TIME_SECTION_END(update);

    TIME_SECTION_START(forward);
    forward_packet(entry);
    TIME_SECTION_END(forward);
}

// Report timing
void print_timing_report() {
    for (int i = 0; i < num_timings; i++) {
        timing_info_t *t = &timings[i];
        double avg_cycles = (double)t->total_time / t->call_count;

        printf("%20s: %10lu calls, %8.2f cycles/call\n",
               t->name, t->call_count, avg_cycles);
    }
}

This gives me precise timing for specific code sections.

Step 3: Understanding the Bottleneck

Profiling shows what is slow. The next step is understanding why.

Analyzing Hot Functions

When perf shows a hot function, I examine it carefully:

// Hot function identified by profiling
flow_entry_t* flow_table_lookup(flow_table_t *table, wwpn_t key) {
    // Hypothesis 1: Hash function is expensive
    uint32_t hash = wwpn_hash(key);  // Measured: 30 cycles

    // Hypothesis 2: Hash table collisions
    uint32_t index = hash & table->mask;

    // Hypothesis 3: Cache misses on lookup
    flow_entry_t *entry = &table->entries[index];  // Cache miss here!

    // Hypothesis 4: Linear probing is expensive
    while (entry->in_use && entry->key != key) {
        index = (index + 1) & table->mask;
        entry = &table->entries[index];  // Another cache miss
    }

    return entry;
}

By annotating the function with hypotheses and measurements, I identify that cache misses during linear probing are the real bottleneck.

Cache Miss Analysis

Cache misses are often the culprit. I use perf to confirm:

perf record -e cache-misses -g ./fc_redirect_test
perf report --stdio

This shows which functions cause cache misses. For our lookup function:

68.4%  flow_table_lookup
  |--52.1%-- linear probing
  |--31.2%-- entry access
  '--16.7%-- hash computation

Most cache misses happen during linear probing, confirming the hypothesis.

Building a Mental Model

I visualize memory access patterns:

Hash table layout in memory:
[Entry 0][Entry 1][Entry 2]...[Entry N]

Linear probing for key with hash=5:
1. Access entry[5]     <- Cache miss (random access)
2. Key doesn't match
3. Access entry[6]     <- Might hit (sequential)
4. Key doesn't match
5. Access entry[7]     <- Probably hit (sequential)
6. Key matches! Return

Problem: First access is random, causing cache miss.
If we have many collisions, we have many random accesses.

This mental model suggests optimizations:

Reduce collisions (better hash function)
Improve cache locality (better data layout)
Prefetch entries (hide latency)

Step 4: Optimization

With understanding comes optimization. I prioritize based on impact:

Low-Hanging Fruit First

Start with changes that give big wins for little effort:

// Before: Computing strlen every iteration
for (int i = 0; i < strlen(str); i++) {  // strlen() in loop condition!
    process_char(str[i]);
}

// After: Compute once
size_t len = strlen(str);
for (int i = 0; i < len; i++) {
    process_char(str[i]);
}

// Impact: 40% speedup for string processing

These “code smell” fixes often provide surprising improvements.

Algorithmic Improvements

Next, algorithmic changes:

// Before: O(n) lookup
flow_entry_t* find_flow_linear(flow_list_t *list, wwpn_t key) {
    for (entry = list->head; entry != NULL; entry = entry->next) {
        if (entry->key == key) return entry;
    }
    return NULL;
}

// After: O(1) lookup with hash table
flow_entry_t* find_flow_hash(flow_table_t *table, wwpn_t key) {
    uint32_t hash = wwpn_hash(key);
    uint32_t index = hash & table->mask;
    return &table->entries[index];
}

// Impact: 67x speedup on 12K flow table

Algorithmic improvements provide the biggest wins.

Microoptimizations

Finally, microoptimizations for the truly hot paths:

// Before: Function call overhead
static inline uint32_t min(uint32_t a, uint32_t b) {
    return (a < b) ? a : b;
}

// After: Branchless min
static inline uint32_t min_branchless(uint32_t a, uint32_t b) {
    return a - ((a - b) & -(a > b));
}

// Before: Expensive division
uint32_t index = hash % table->size;

// After: Bitwise AND (requires power-of-2 size)
uint32_t index = hash & (table->size - 1);

// Before: Unpredictable branch
if (flow->type == TYPE_A) {
    process_type_a(flow);
} else {
    process_type_b(flow);
}

// After: Function pointer (if predictable)
flow->process_fn(flow);

These matter only in the innermost loops but can provide 10-20% improvements.

Step 5: Validation

After optimization, I validate the improvement:

void validate_optimization(test_workload_t *workload,
                          performance_baseline_t *before) {
    // Capture new performance
    performance_baseline_t after = capture_baseline(workload);

    // Compute improvements
    double throughput_improvement =
        ((double)after.packets_per_second / before->packets_per_second - 1.0) * 100;

    double latency_improvement =
        (1.0 - (double)after.p99_latency_ns / before->p99_latency_ns) * 100;

    double cpu_improvement =
        (1.0 - after.cpu_utilization / before->cpu_utilization) * 100;

    // Report
    printf("Optimization Results:\n");
    printf("  Throughput: %+.2f%%\n", throughput_improvement);
    printf("  P99 Latency: %+.2f%%\n", latency_improvement);
    printf("  CPU Usage: %+.2f%%\n", cpu_improvement);

    // Regression test
    assert(throughput_improvement >= 0);  // Must not regress
    assert(latency_improvement >= 0);
    assert(verify_correctness());  // Must still be correct!
}

Optimizations must be validated. I’ve had “optimizations” that made things worse.

Common Patterns I’ve Learned

Through repeated optimization cycles, I’ve identified common patterns:

Pattern 1: Cache-Conscious Data Structures

Hot data should fit in cache:

// Bad: Large structures with hot and cold fields mixed
typedef struct flow_entry {
    wwpn_t key;           // Hot
    uint64_t packets;     // Hot
    uint64_t bytes;       // Hot
    char description[256]; // Cold - wastes cache lines!
    timestamp_t created;  // Cold
    metadata_t metadata;  // Cold
};

// Good: Separate hot and cold data
typedef struct flow_entry_hot {
    wwpn_t key;
    uint64_t packets;
    uint64_t bytes;
    uint32_t cold_index;  // Pointer to cold data
} __attribute__((aligned(64)));  // One cache line

typedef struct flow_entry_cold {
    char description[256];
    timestamp_t created;
    metadata_t metadata;
};

Pattern 2: Batching and Amortization

Amortize expensive operations:

// Bad: Update storage on every packet
void process_packet_unbatched(packet_t *pkt) {
    update_statistics(pkt);
    write_to_storage(statistics);  // Expensive I/O every time!
}

// Good: Batch storage updates
void process_packet_batched(packet_t *pkt) {
    update_statistics(pkt);

    if (++batch_count >= BATCH_SIZE || time_since_flush > TIMEOUT) {
        flush_statistics_to_storage();  // Amortized cost
        batch_count = 0;
    }
}

Pattern 3: Lock-Free Where Possible

Avoid locks on hot paths:

// Bad: Lock for every counter increment
pthread_mutex_lock(&counter_lock);
counter++;
pthread_mutex_unlock(&counter_lock);

// Good: Atomic increment
atomic_fetch_add(&counter, 1);

// Better: Per-thread counters (lock-free)
thread_local uint64_t thread_counter = 0;
thread_counter++;  // No contention!

// Periodically aggregate
uint64_t get_total_count() {
    uint64_t total = 0;
    for (int i = 0; i < num_threads; i++) {
        total += thread_counters[i];
    }
    return total;
}

Tools I Use Daily

My optimization toolkit:

perf: CPU profiling, hardware counters
flamegraph.pl: Visual profiling
valgrind —tool=cachegrind: Cache simulation
gdb: Inspecting live behavior
custom instrumentation: Application-specific metrics
Brendan Gregg’s tools: Various system analysis scripts

Lessons Learned

After a year of optimization work:

Profile first, always: My intuition about bottlenecks is wrong 60% of the time.
Fix the biggest problem: Optimizing the second-biggest bottleneck when the first is 10x bigger is waste.
Measure the impact: Some “optimizations” made things worse. Always measure.
Correctness trumps performance: Fast and wrong is useless.
Document why: Future me needs to understand why code is written strangely for performance.
Know when to stop: At some point, additional optimization has diminishing returns.

Conclusion

Performance optimization is a systematic process: measure, understand, optimize, validate. This workflow has served me well across diverse optimization challenges in FC-Redirect.

The key is rigor. No guessing, no premature optimization, no unvalidated changes. Just careful measurement, thoughtful analysis, and disciplined implementation.

As FC-Redirect continues to grow throughout 2013, this optimization workflow remains essential. Every time we add features or scale further, performance work is needed. Having a proven process makes the work tractable and effective.

If you’re working on performance-critical systems, I encourage you to develop your own systematic approach. The specific tools matter less than the discipline of measure-understand-optimize-validate.

Performance optimization is as much art as science, but systematic methodology turns it from guesswork into engineering.