Over the past year, I’ve spent countless hours optimizing FC-Redirect. Through trial and error, I’ve developed a systematic workflow that consistently produces results. Today I want to share this approach, along with the tools and techniques that make it effective.
The Three Commandments of Optimization
Before diving into specifics, three principles guide all my optimization work:
- Measure, don’t guess: Intuition is often wrong. Always profile before optimizing.
- Optimize the critical path: Focus on what matters. 90% of time is spent in 10% of code.
- Validate improvements: Measure after optimization to confirm the improvement is real.
These seem obvious, but I’ve wasted weeks ignoring them.
Step 1: Establish a Baseline
Before optimizing anything, I establish a comprehensive baseline:
Performance Metrics
typedef struct performance_baseline {
// Throughput
uint64_t packets_per_second;
uint64_t bytes_per_second;
// Latency
latency_histogram_t latency_dist;
uint64_t p50_latency_ns;
uint64_t p99_latency_ns;
uint64_t p999_latency_ns;
// Resource utilization
float cpu_utilization;
float memory_utilization;
uint64_t cache_misses;
uint64_t tlb_misses;
// System characteristics
uint64_t context_switches_per_sec;
uint64_t syscalls_per_sec;
} performance_baseline_t;
performance_baseline_t capture_baseline(test_workload_t *workload) {
performance_baseline_t baseline = {0};
// Run workload
start_perf_monitoring();
run_workload(workload, 60*SECONDS);
stop_perf_monitoring();
// Collect metrics
baseline.packets_per_second = get_packet_rate();
baseline.latency_dist = get_latency_histogram();
baseline.p50_latency_ns = histogram_percentile(&baseline.latency_dist, 50);
baseline.p99_latency_ns = histogram_percentile(&baseline.latency_dist, 99);
baseline.p999_latency_ns = histogram_percentile(&baseline.latency_dist, 99.9);
// CPU metrics
baseline.cpu_utilization = get_cpu_utilization();
baseline.cache_misses = get_perf_counter(PERF_COUNT_CACHE_MISSES);
baseline.tlb_misses = get_perf_counter(PERF_COUNT_TLB_MISSES);
return baseline;
}
This baseline becomes my reference point. Any optimization must improve these metrics.
Workload Characterization
I create representative workloads that stress different aspects:
// Workload 1: Many small flows (latency-sensitive)
test_workload_t create_latency_workload() {
return (test_workload_t) {
.num_flows = 10000,
.flow_size_bytes = 1024, // 1KB flows
.arrival_rate = 50000, // 50K flows/sec
.duration_seconds = 60
};
}
// Workload 2: Few large flows (throughput-focused)
test_workload_t create_throughput_workload() {
return (test_workload_t) {
.num_flows = 100,
.flow_size_bytes = 100*1024*1024, // 100MB flows
.arrival_rate = 10,
.duration_seconds = 60
};
}
// Workload 3: Mixed (realistic)
test_workload_t create_mixed_workload() {
return (test_workload_t) {
.num_flows = 5000,
.flow_size_distribution = DIST_PARETO, // 80/20 rule
.arrival_pattern = PATTERN_BURSTY,
.duration_seconds = 60
};
}
Different workloads reveal different bottlenecks. The latency workload exposes lookup overhead, throughput workload reveals bandwidth limits, and mixed workload shows real-world behavior.
Step 2: Profiling
With a baseline established, I profile to find bottlenecks. I use multiple tools:
CPU Profiling with perf
For CPU-bound workloads, Linux perf is indispensable:
# Record profile
perf record -g -F 999 ./fc_redirect_test
# Analyze profile
perf report --stdio --no-children | head -50
This shows where CPU time is spent. A typical output:
47.32% [.] flow_table_lookup
12.18% [.] wwpn_hash
8.91% [.] update_flow_statistics
6.47% [.] process_packet
...
This immediately tells me that flow lookup is the bottleneck consuming 47% of CPU time.
Flamegraphs
For visual analysis, I generate flamegraphs:
perf record -g -F 999 ./fc_redirect_test
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > profile.svg
Flamegraphs make it obvious where time is spent and how functions call each other. Wide bars = hot paths.
Hardware Performance Counters
For deeper analysis, I use hardware counters:
# Cache miss analysis
perf stat -e cache-references,cache-misses,L1-dcache-load-misses \
./fc_redirect_test
# Branch prediction analysis
perf stat -e branches,branch-misses \
./fc_redirect_test
# Memory bandwidth
perf stat -e cpu/event=0xb7,umask=0x1/ \
./fc_redirect_test
This reveals microarchitectural issues like cache misses or branch mispredictions.
Custom Instrumentation
For application-specific insights, I add custom timing:
#define TIMING_ENABLED 1
typedef struct timing_info {
const char *name;
uint64_t total_time;
uint64_t call_count;
} timing_info_t;
static timing_info_t timings[100];
static int num_timings = 0;
#define TIME_SECTION_START(name) \
uint64_t _time_start_##name = rdtsc(); \
int _time_index_##name = get_or_create_timing(#name);
#define TIME_SECTION_END(name) \
uint64_t _time_end_##name = rdtsc(); \
timings[_time_index_##name].total_time += \
(_time_end_##name - _time_start_##name); \
timings[_time_index_##name].call_count++;
// Usage
void process_flow(flow_entry_t *flow) {
TIME_SECTION_START(lookup);
flow_entry_t *entry = flow_table_lookup(flow->key);
TIME_SECTION_END(lookup);
TIME_SECTION_START(update);
update_flow_statistics(entry);
TIME_SECTION_END(update);
TIME_SECTION_START(forward);
forward_packet(entry);
TIME_SECTION_END(forward);
}
// Report timing
void print_timing_report() {
for (int i = 0; i < num_timings; i++) {
timing_info_t *t = &timings[i];
double avg_cycles = (double)t->total_time / t->call_count;
printf("%20s: %10lu calls, %8.2f cycles/call\n",
t->name, t->call_count, avg_cycles);
}
}
This gives me precise timing for specific code sections.
Step 3: Understanding the Bottleneck
Profiling shows what is slow. The next step is understanding why.
Analyzing Hot Functions
When perf shows a hot function, I examine it carefully:
// Hot function identified by profiling
flow_entry_t* flow_table_lookup(flow_table_t *table, wwpn_t key) {
// Hypothesis 1: Hash function is expensive
uint32_t hash = wwpn_hash(key); // Measured: 30 cycles
// Hypothesis 2: Hash table collisions
uint32_t index = hash & table->mask;
// Hypothesis 3: Cache misses on lookup
flow_entry_t *entry = &table->entries[index]; // Cache miss here!
// Hypothesis 4: Linear probing is expensive
while (entry->in_use && entry->key != key) {
index = (index + 1) & table->mask;
entry = &table->entries[index]; // Another cache miss
}
return entry;
}
By annotating the function with hypotheses and measurements, I identify that cache misses during linear probing are the real bottleneck.
Cache Miss Analysis
Cache misses are often the culprit. I use perf to confirm:
perf record -e cache-misses -g ./fc_redirect_test
perf report --stdio
This shows which functions cause cache misses. For our lookup function:
68.4% flow_table_lookup
|--52.1%-- linear probing
|--31.2%-- entry access
'--16.7%-- hash computation
Most cache misses happen during linear probing, confirming the hypothesis.
Building a Mental Model
I visualize memory access patterns:
Hash table layout in memory:
[Entry 0][Entry 1][Entry 2]...[Entry N]
Linear probing for key with hash=5:
1. Access entry[5] <- Cache miss (random access)
2. Key doesn't match
3. Access entry[6] <- Might hit (sequential)
4. Key doesn't match
5. Access entry[7] <- Probably hit (sequential)
6. Key matches! Return
Problem: First access is random, causing cache miss.
If we have many collisions, we have many random accesses.
This mental model suggests optimizations:
- Reduce collisions (better hash function)
- Improve cache locality (better data layout)
- Prefetch entries (hide latency)
Step 4: Optimization
With understanding comes optimization. I prioritize based on impact:
Low-Hanging Fruit First
Start with changes that give big wins for little effort:
// Before: Computing strlen every iteration
for (int i = 0; i < strlen(str); i++) { // strlen() in loop condition!
process_char(str[i]);
}
// After: Compute once
size_t len = strlen(str);
for (int i = 0; i < len; i++) {
process_char(str[i]);
}
// Impact: 40% speedup for string processing
These “code smell” fixes often provide surprising improvements.
Algorithmic Improvements
Next, algorithmic changes:
// Before: O(n) lookup
flow_entry_t* find_flow_linear(flow_list_t *list, wwpn_t key) {
for (entry = list->head; entry != NULL; entry = entry->next) {
if (entry->key == key) return entry;
}
return NULL;
}
// After: O(1) lookup with hash table
flow_entry_t* find_flow_hash(flow_table_t *table, wwpn_t key) {
uint32_t hash = wwpn_hash(key);
uint32_t index = hash & table->mask;
return &table->entries[index];
}
// Impact: 67x speedup on 12K flow table
Algorithmic improvements provide the biggest wins.
Microoptimizations
Finally, microoptimizations for the truly hot paths:
// Before: Function call overhead
static inline uint32_t min(uint32_t a, uint32_t b) {
return (a < b) ? a : b;
}
// After: Branchless min
static inline uint32_t min_branchless(uint32_t a, uint32_t b) {
return a - ((a - b) & -(a > b));
}
// Before: Expensive division
uint32_t index = hash % table->size;
// After: Bitwise AND (requires power-of-2 size)
uint32_t index = hash & (table->size - 1);
// Before: Unpredictable branch
if (flow->type == TYPE_A) {
process_type_a(flow);
} else {
process_type_b(flow);
}
// After: Function pointer (if predictable)
flow->process_fn(flow);
These matter only in the innermost loops but can provide 10-20% improvements.
Step 5: Validation
After optimization, I validate the improvement:
void validate_optimization(test_workload_t *workload,
performance_baseline_t *before) {
// Capture new performance
performance_baseline_t after = capture_baseline(workload);
// Compute improvements
double throughput_improvement =
((double)after.packets_per_second / before->packets_per_second - 1.0) * 100;
double latency_improvement =
(1.0 - (double)after.p99_latency_ns / before->p99_latency_ns) * 100;
double cpu_improvement =
(1.0 - after.cpu_utilization / before->cpu_utilization) * 100;
// Report
printf("Optimization Results:\n");
printf(" Throughput: %+.2f%%\n", throughput_improvement);
printf(" P99 Latency: %+.2f%%\n", latency_improvement);
printf(" CPU Usage: %+.2f%%\n", cpu_improvement);
// Regression test
assert(throughput_improvement >= 0); // Must not regress
assert(latency_improvement >= 0);
assert(verify_correctness()); // Must still be correct!
}
Optimizations must be validated. I’ve had “optimizations” that made things worse.
Common Patterns I’ve Learned
Through repeated optimization cycles, I’ve identified common patterns:
Pattern 1: Cache-Conscious Data Structures
Hot data should fit in cache:
// Bad: Large structures with hot and cold fields mixed
typedef struct flow_entry {
wwpn_t key; // Hot
uint64_t packets; // Hot
uint64_t bytes; // Hot
char description[256]; // Cold - wastes cache lines!
timestamp_t created; // Cold
metadata_t metadata; // Cold
};
// Good: Separate hot and cold data
typedef struct flow_entry_hot {
wwpn_t key;
uint64_t packets;
uint64_t bytes;
uint32_t cold_index; // Pointer to cold data
} __attribute__((aligned(64))); // One cache line
typedef struct flow_entry_cold {
char description[256];
timestamp_t created;
metadata_t metadata;
};
Pattern 2: Batching and Amortization
Amortize expensive operations:
// Bad: Update storage on every packet
void process_packet_unbatched(packet_t *pkt) {
update_statistics(pkt);
write_to_storage(statistics); // Expensive I/O every time!
}
// Good: Batch storage updates
void process_packet_batched(packet_t *pkt) {
update_statistics(pkt);
if (++batch_count >= BATCH_SIZE || time_since_flush > TIMEOUT) {
flush_statistics_to_storage(); // Amortized cost
batch_count = 0;
}
}
Pattern 3: Lock-Free Where Possible
Avoid locks on hot paths:
// Bad: Lock for every counter increment
pthread_mutex_lock(&counter_lock);
counter++;
pthread_mutex_unlock(&counter_lock);
// Good: Atomic increment
atomic_fetch_add(&counter, 1);
// Better: Per-thread counters (lock-free)
thread_local uint64_t thread_counter = 0;
thread_counter++; // No contention!
// Periodically aggregate
uint64_t get_total_count() {
uint64_t total = 0;
for (int i = 0; i < num_threads; i++) {
total += thread_counters[i];
}
return total;
}
Tools I Use Daily
My optimization toolkit:
- perf: CPU profiling, hardware counters
- flamegraph.pl: Visual profiling
- valgrind —tool=cachegrind: Cache simulation
- gdb: Inspecting live behavior
- custom instrumentation: Application-specific metrics
- Brendan Gregg’s tools: Various system analysis scripts
Lessons Learned
After a year of optimization work:
-
Profile first, always: My intuition about bottlenecks is wrong 60% of the time.
-
Fix the biggest problem: Optimizing the second-biggest bottleneck when the first is 10x bigger is waste.
-
Measure the impact: Some “optimizations” made things worse. Always measure.
-
Correctness trumps performance: Fast and wrong is useless.
-
Document why: Future me needs to understand why code is written strangely for performance.
-
Know when to stop: At some point, additional optimization has diminishing returns.
Conclusion
Performance optimization is a systematic process: measure, understand, optimize, validate. This workflow has served me well across diverse optimization challenges in FC-Redirect.
The key is rigor. No guessing, no premature optimization, no unvalidated changes. Just careful measurement, thoughtful analysis, and disciplined implementation.
As FC-Redirect continues to grow throughout 2013, this optimization workflow remains essential. Every time we add features or scale further, performance work is needed. Having a proven process makes the work tractable and effective.
If you’re working on performance-critical systems, I encourage you to develop your own systematic approach. The specific tools matter less than the discipline of measure-understand-optimize-validate.
Performance optimization is as much art as science, but systematic methodology turns it from guesswork into engineering.