Latency is a feature. In a world where users expect instant responses, every millisecond matters. This post chronicles our journey optimizing a high-traffic API platform from 400ms to 50ms p95 latency—an 8x improvement. The techniques shared here apply to any high-scale distributed system.

Understanding the Baseline

Before optimizing, we needed to understand where time was being spent. Our initial architecture looked like this:

User Request → Load Balancer → API Gateway → Service → Database
   ~10ms         ~20ms           ~350ms       ~20ms

The 350ms in the service layer was our primary target.

Measuring Everything

You can’t improve what you can’t measure. We instrumented the entire request path:

@Component
public class LatencyMonitor {

    private final MeterRegistry metrics;

    public <T> T measureLatency(String operation, Supplier<T> supplier) {
        Timer.Sample sample = Timer.start(metrics);
        long startNanos = System.nanoTime();

        try {
            T result = supplier.get();
            long durationNanos = System.nanoTime() - startNanos;

            // Record to metrics
            sample.stop(metrics.timer("latency",
                "operation", operation,
                "status", "success"
            ));

            // Log slow operations
            if (durationNanos > 100_000_000) { // 100ms
                logger.warn("Slow operation: {} took {}ms",
                    operation,
                    durationNanos / 1_000_000
                );
            }

            return result;

        } catch (Exception e) {
            sample.stop(metrics.timer("latency",
                "operation", operation,
                "status", "error"
            ));
            throw e;
        }
    }
}

// Usage
@Service
public class EventService {

    public Event getEvent(String eventId) {
        return latencyMonitor.measureLatency("getEvent", () -> {
            Event event = latencyMonitor.measureLatency("db.query",
                () -> database.query(eventId)
            );

            latencyMonitor.measureLatency("enrichment",
                () -> enricher.enrich(event)
            );

            return event;
        });
    }
}

This revealed our top latency contributors:

  1. Database queries: 180ms
  2. External service calls: 120ms
  3. Data serialization: 30ms
  4. Business logic: 20ms

Optimization 1: Database Query Optimization

Problem: N+1 Query Pattern

// BEFORE: N+1 queries (slow!)
@Service
public class EventService {
    public List<EnrichedEvent> getEvents(List<String> eventIds) {
        List<EnrichedEvent> results = new ArrayList<>();

        for (String eventId : eventIds) {
            Event event = eventRepository.findById(eventId);  // 1 query
            User user = userRepository.findById(event.getUserId());  // N queries!
            results.add(new EnrichedEvent(event, user));
        }

        return results;
    }
}

Solution: Batch Loading

// AFTER: 2 queries total
@Service
public class OptimizedEventService {

    public List<EnrichedEvent> getEvents(List<String> eventIds) {
        // Single query for all events
        List<Event> events = eventRepository.findAllById(eventIds);

        // Single query for all users
        Set<String> userIds = events.stream()
            .map(Event::getUserId)
            .collect(Collectors.toSet());
        Map<String, User> users = userRepository.findAllById(userIds)
            .stream()
            .collect(Collectors.toMap(User::getId, Function.identity()));

        // Combine in memory
        return events.stream()
            .map(event -> new EnrichedEvent(event, users.get(event.getUserId())))
            .collect(Collectors.toList());
    }
}

Result: Database query time reduced from 180ms → 40ms

Adding Strategic Indexes

-- BEFORE: Full table scan
SELECT * FROM events
WHERE user_id = ? AND created_at > ?
ORDER BY created_at DESC
LIMIT 100;

-- Add composite index
CREATE INDEX idx_events_user_created
ON events(user_id, created_at DESC);

-- AFTER: Index scan
EXPLAIN SELECT * FROM events
WHERE user_id = ? AND created_at > ?
ORDER BY created_at DESC
LIMIT 100;

-- Query plan now shows:
-- Index Scan using idx_events_user_created on events (cost=0.43..123.45 rows=100)

Optimization 2: Caching Strategy

Distributed Caching with Redis

@Service
public class CachedEventService {

    private final RedisTemplate<String, Event> redis;
    private final EventRepository repository;
    private final Duration cacheTTL = Duration.ofMinutes(15);

    public Event getEvent(String eventId) {
        // Try cache first
        String cacheKey = "event:" + eventId;
        Event cached = redis.opsForValue().get(cacheKey);

        if (cached != null) {
            metrics.counter("cache.hit", "entity", "event").increment();
            return cached;
        }

        metrics.counter("cache.miss", "entity", "event").increment();

        // Load from database
        Event event = repository.findById(eventId)
            .orElseThrow(() -> new EventNotFoundException(eventId));

        // Populate cache
        redis.opsForValue().set(cacheKey, event, cacheTTL);

        return event;
    }

    public void updateEvent(Event event) {
        // Update database
        repository.save(event);

        // Invalidate cache
        redis.delete("event:" + event.getId());

        // Also invalidate any derived caches
        redis.delete("user:events:" + event.getUserId());
    }
}

Local Caching for Hot Data

type LocalCache struct {
    cache *lru.Cache
    ttl   time.Duration
    mu    sync.RWMutex
}

type cacheEntry struct {
    value      interface{}
    expiration time.Time
}

func (c *LocalCache) Get(key string) (interface{}, bool) {
    c.mu.RLock()
    defer c.mu.RUnlock()

    if entry, ok := c.cache.Get(key); ok {
        e := entry.(*cacheEntry)
        if time.Now().Before(e.expiration) {
            return e.value, true
        }
        // Expired
        c.cache.Remove(key)
    }

    return nil, false
}

func (c *LocalCache) Set(key string, value interface{}) {
    c.mu.Lock()
    defer c.mu.Unlock()

    c.cache.Add(key, &cacheEntry{
        value:      value,
        expiration: time.Now().Add(c.ttl),
    })
}

// Usage: two-tier caching
func (s *Service) GetUser(userID string) (*User, error) {
    // L1: Local cache (microseconds)
    if user, ok := s.localCache.Get(userID); ok {
        return user.(*User), nil
    }

    // L2: Redis cache (milliseconds)
    if user, err := s.redisCache.Get(userID); err == nil {
        s.localCache.Set(userID, user)
        return user, nil
    }

    // L3: Database (tens of milliseconds)
    user, err := s.db.GetUser(userID)
    if err != nil {
        return nil, err
    }

    s.redisCache.Set(userID, user)
    s.localCache.Set(userID, user)

    return user, nil
}

Result: Cache hit rate of 85%, reducing average data access from 40ms → 5ms

Optimization 3: Parallel Processing

Before: Sequential Processing

def process_event(event_id):
    # Sequential - takes sum of all operations
    event = db.get_event(event_id)  # 20ms
    user = db.get_user(event.user_id)  # 20ms
    metadata = external_service.get_metadata(event.metadata_id)  # 60ms

    enriched = enrich(event, user, metadata)  # 10ms
    return enriched
    # Total: 110ms

After: Parallel Processing

import asyncio
from concurrent.futures import ThreadPoolExecutor

async def process_event_parallel(event_id):
    # Parallel - takes max of all operations

    # Fire all I/O operations concurrently
    event_task = asyncio.create_task(async_db.get_event(event_id))

    # Get user and metadata in parallel once we have event
    event = await event_task

    user_task = asyncio.create_task(async_db.get_user(event.user_id))
    metadata_task = asyncio.create_task(
        async_external_service.get_metadata(event.metadata_id)
    )

    # Wait for both
    user, metadata = await asyncio.gather(user_task, metadata_task)

    # CPU-bound work
    enriched = enrich(event, user, metadata)
    return enriched
    # Total: 20ms + max(20ms, 60ms) + 10ms = 90ms

For Java, use CompletableFuture:

public CompletableFuture<EnrichedEvent> processEventParallel(String eventId) {

    return eventRepository.findByIdAsync(eventId)
        .thenCompose(event -> {
            // Launch parallel operations
            CompletableFuture<User> userFuture =
                userRepository.findByIdAsync(event.getUserId());

            CompletableFuture<Metadata> metadataFuture =
                externalService.getMetadataAsync(event.getMetadataId());

            // Combine results
            return CompletableFuture.allOf(userFuture, metadataFuture)
                .thenApply(v -> enricher.enrich(
                    event,
                    userFuture.join(),
                    metadataFuture.join()
                ));
        });
}

Result: External service integration time reduced from 120ms → 60ms

Optimization 4: Serialization Optimization

Problem: Jackson Default Serialization

// BEFORE: Reflection-based serialization
@Data
public class Event {
    private String id;
    private String userId;
    private String type;
    private Map<String, Object> attributes;
    private Instant timestamp;
}

ObjectMapper mapper = new ObjectMapper();
String json = mapper.writeValueAsString(event);  // 15ms for complex objects

Solution: Optimized Serialization

// Use afterburner module for bytecode generation
ObjectMapper mapper = new ObjectMapper()
    .registerModule(new AfterburnerModule())
    // Don't serialize nulls
    .setSerializationInclusion(JsonInclude.Include.NON_NULL)
    // Use optimized date serialization
    .disable(SerializationFeature.WRITE_DATES_AS_TIMESTAMPS);

// For ultra-hot paths, use manual serialization
public class FastEventSerializer {

    private static final byte[] ID_PREFIX = "\"id\":\"".getBytes();
    private static final byte[] USER_ID_PREFIX = "\",\"userId\":\"".getBytes();
    // ... other fields

    public byte[] serialize(Event event) {
        ByteArrayOutputStream baos = new ByteArrayOutputStream(512);

        try {
            baos.write('{');
            baos.write(ID_PREFIX);
            baos.write(event.getId().getBytes());
            baos.write(USER_ID_PREFIX);
            baos.write(event.getUserId().getBytes());
            // ... serialize other fields
            baos.write('}');

        } catch (IOException e) {
            throw new RuntimeException(e);
        }

        return baos.toByteArray();
    }
}

For even better performance, use protobuf:

syntax = "proto3";

message Event {
    string id = 1;
    string user_id = 2;
    string type = 3;
    map<string, string> attributes = 4;
    int64 timestamp = 5;
}
// Protobuf serialization is 3-5x faster
Event event = Event.newBuilder()
    .setId(id)
    .setUserId(userId)
    .setType(type)
    .setTimestamp(System.currentTimeMillis())
    .build();

byte[] bytes = event.toByteArray();  // 3ms

Result: Serialization time reduced from 30ms → 8ms

Optimization 5: Connection Pooling

// BEFORE: Creating new connections
public Event getEvent(String id) {
    try (Connection conn = DriverManager.getConnection(url, user, password)) {
        // Query - connection establishment adds 20-50ms!
    }
}

// AFTER: Connection pooling
@Configuration
public class DataSourceConfig {

    @Bean
    public DataSource dataSource() {
        HikariConfig config = new HikariConfig();
        config.setJdbcUrl(jdbcUrl);
        config.setUsername(username);
        config.setPassword(password);

        // Tune pool size based on load
        config.setMaximumPoolSize(50);
        config.setMinimumIdle(10);

        // Connection validation
        config.setConnectionTimeout(5000);
        config.setIdleTimeout(600000);
        config.setMaxLifetime(1800000);

        // Performance tuning
        config.addDataSourceProperty("cachePrepStmts", "true");
        config.addDataSourceProperty("prepStmtCacheSize", "250");
        config.addDataSourceProperty("prepStmtCacheSqlLimit", "2048");

        return new HikariDataSource(config);
    }
}

Result: Database connection overhead reduced from ~25ms → ~0ms (amortized)

Optimization 6: Reducing Payload Size

// BEFORE: Sending everything
@GetMapping("/events/{id}")
public Event getEvent(@PathVariable String id) {
    Event event = eventService.get(id);
    return event;  // 50KB response
}

// AFTER: Field filtering
@GetMapping("/events/{id}")
public EventDTO getEvent(
        @PathVariable String id,
        @RequestParam(required = false) Set<String> fields) {

    Event event = eventService.get(id);

    if (fields == null || fields.isEmpty()) {
        return eventMapper.toDTO(event);
    }

    // Return only requested fields
    return eventMapper.toDTO(event, fields);  // 5KB response
}

// Client specifies fields
// GET /events/123?fields=id,type,timestamp

Smaller payloads = faster network transfer and serialization.

Optimization 7: Algorithm Improvements

Sometimes the biggest wins come from algorithmic changes:

# BEFORE: O(n²) algorithm
def find_related_events(event_id, all_events):
    related = []
    target = get_event(event_id)

    for event in all_events:
        if is_related(target, event):  # Expensive comparison
            related.append(event)

    return related
    # For 10,000 events: ~5 seconds

# AFTER: O(n) with indexing
class EventIndex:
    def __init__(self):
        self.by_user = defaultdict(list)
        self.by_type = defaultdict(list)
        self.by_tag = defaultdict(set)

    def index_event(self, event):
        self.by_user[event.user_id].append(event.id)
        self.by_type[event.type].append(event.id)
        for tag in event.tags:
            self.by_tag[tag].add(event.id)

    def find_related(self, event_id):
        event = get_event(event_id)

        # Use indexes to find candidates
        candidates = set()
        candidates.update(self.by_user[event.user_id])
        candidates.update(self.by_type[event.type])

        for tag in event.tags:
            candidates.update(self.by_tag[tag])

        candidates.discard(event_id)
        return list(candidates)
        # For 10,000 events: ~50ms

Monitoring the Improvements

Track latency improvements with percentile metrics:

@Component
public class LatencyTracker {

    private final MeterRegistry registry;

    public void recordLatency(String endpoint, long durationMs) {
        registry.timer("http.request.duration",
            "endpoint", endpoint
        ).record(durationMs, TimeUnit.MILLISECONDS);
    }

    // Query percentiles
    public LatencyStats getStats(String endpoint) {
        Timer timer = registry.find("http.request.duration")
            .tag("endpoint", endpoint)
            .timer();

        return new LatencyStats(
            timer.mean(TimeUnit.MILLISECONDS),
            timer.percentile(0.50),  // p50 (median)
            timer.percentile(0.95),  // p95
            timer.percentile(0.99),  // p99
            timer.max(TimeUnit.MILLISECONDS)
        );
    }
}

Results Summary

OptimizationBeforeAfterImprovement
Database queries180ms40ms4.5x
Caching layer--35msNew
External services120ms60ms2x
Serialization30ms8ms3.75x
Connection pooling25ms~0ms
Total p95400ms50ms8x

Key Takeaways

  1. Measure first: Instrument everything before optimizing
  2. Fix the biggest problems: Focus on operations taking >10% of total time
  3. Database is often the bottleneck: Optimize queries, add indexes, batch operations
  4. Caching is powerful: But manage invalidation carefully
  5. Parallelize I/O: Don’t wait for sequential operations
  6. Reduce data transfer: Smaller payloads are faster
  7. Monitor continuously: Latency degrades over time without vigilance

Latency optimization is an ongoing journey, not a one-time project. Build monitoring, measurement, and optimization into your development culture.