Latency matters. Every extra 100 milliseconds of response time can cost you users, revenue, and competitive advantage. This is the story of how we systematically reduced our API response time from 400ms to 50ms—an 8x improvement that transformed our user experience.

Starting Point: Understanding the Problem

When users started complaining about slow page loads, we knew we had a performance problem. But knowing you’re slow and knowing why you’re slow are two very different things.

Our initial p95 latency was 400ms. That means 95% of requests took 400ms or less—and the remaining 5% were even slower. For a web application, 400ms feels sluggish. Users notice delays over 100ms.

The first step wasn’t writing code—it was measurement.

Deep Instrumentation

We instrumented every layer of the stack to understand where time was being spent:

@Component
public class LatencyProfiler {
    private final MeterRegistry metrics;

    public <T> T measureOperation(String operation, Supplier<T> supplier) {
        long startNanos = System.nanoTime();

        try {
            T result = supplier.get();
            recordSuccess(operation, System.nanoTime() - startNanos);
            return result;

        } catch (Exception e) {
            recordFailure(operation, System.nanoTime() - startNanos);
            throw e;
        }
    }

    private void recordSuccess(String operation, long durationNanos) {
        double durationMs = durationNanos / 1_000_000.0;

        metrics.timer("latency",
            "operation", operation,
            "status", "success"
        ).record(Duration.ofNanos(durationNanos));

        // Alert on slow operations
        if (durationMs > 100) {
            logger.warn("Slow operation: {} took {}ms", operation, durationMs);
        }
    }
}

This instrumentation revealed our latency breakdown:

  • Database queries: 180ms (45% of total)
  • External service calls: 120ms (30%)
  • Serialization/deserialization: 30ms (7.5%)
  • Connection overhead: 25ms (6.25%)
  • Application logic: 45ms (11.25%)

Armed with data, we knew where to focus.

Phase 1: Eliminating the N+1 Query Problem

The biggest culprit was database queries. Further investigation revealed the classic N+1 query problem.

Our code was loading an order, then making separate database queries for each related entity:

// The problematic code
public List<OrderDTO> getOrders(List<String> orderIds) {
    List<OrderDTO> result = new ArrayList<>();

    for (String orderId : orderIds) {
        Order order = orderRepository.findById(orderId);  // 1 query
        User user = userRepository.findById(order.getUserId());  // N queries
        List<OrderItem> items = itemRepository.findByOrderId(orderId);  // N queries

        result.add(new OrderDTO(order, user, items));
    }

    return result;  // Total: 1 + 2N database queries!
}

For 10 orders, this made 21 database queries. For 100 orders, 201 queries. Each round-trip to the database added 20-30ms of latency. This approach simply doesn’t scale.

The fix was batch loading—fetch everything in three queries instead of hundreds:

public List<OrderDTO> getOrders(List<String> orderIds) {
    // One query for all orders
    List<Order> orders = orderRepository.findAllById(orderIds);

    // Extract all user IDs
    Set<String> userIds = orders.stream()
        .map(Order::getUserId)
        .collect(Collectors.toSet());

    // One query for all users
    Map<String, User> users = userRepository.findAllById(userIds)
        .stream()
        .collect(Collectors.toMap(User::getId, Function.identity()));

    // One query for all items
    Map<String, List<OrderItem>> itemsByOrder = itemRepository
        .findByOrderIdIn(orderIds)
        .stream()
        .collect(Collectors.groupingBy(OrderItem::getOrderId));

    // Assemble results in memory (fast)
    return orders.stream()
        .map(order -> new OrderDTO(
            order,
            users.get(order.getUserId()),
            itemsByOrder.getOrDefault(order.getId(), List.of())
        ))
        .collect(Collectors.toList());
}

This change alone reduced our database time from 180ms to 70ms. But we weren’t done.

Phase 2: Query Optimization and Indexing

Even with batch loading, some queries were slow. We used PostgreSQL’s EXPLAIN ANALYZE to understand why:

EXPLAIN ANALYZE
SELECT o.*, u.name
FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.created_at > '2022-01-01'
  AND o.status = 'pending'
ORDER BY o.created_at DESC
LIMIT 100;

The query plan showed a sequential scan—the database was reading every row in the orders table to find pending orders. With millions of orders, this took 145ms.

The solution was a targeted index:

CREATE INDEX idx_orders_status_created_at
ON orders(status, created_at DESC)
WHERE status = 'pending';

This partial index only includes pending orders and sorts them by creation date—exactly what our query needs. After adding the index, the same query took 12ms. A 12x improvement from one line of SQL.

Lesson learned: Most ORMs hide the actual queries from you. Take the time to look at what’s actually hitting your database.

Phase 3: Caching Layers

Our application was making external API calls for user data, product information, and pricing. Each call added 60-120ms of latency.

The data didn’t change frequently—user profiles are relatively stable, product catalogs update hourly, prices change occasionally. We were making the same calls repeatedly.

We implemented a multi-tier caching strategy:

L1 Cache (In-Memory): Ultra-fast local cache with LRU eviction, 1-minute TTL L2 Cache (Redis): Shared cache across all instances, 15-minute TTL L3 (Source): Original data source

class MultiLevelCache:
    def __init__(self):
        self.l1 = TTLCache(maxsize=10000, ttl=60)
        self.l2 = RedisClient()
        self.source = ExternalAPI()

    async def get(self, key):
        # Try L1 first (microseconds)
        if key in self.l1:
            return self.l1[key]

        # Try L2 (milliseconds)
        value = await self.l2.get(key)
        if value:
            self.l1[key] = value  # Promote to L1
            return value

        # Hit the source (100+ milliseconds)
        value = await self.source.fetch(key)

        # Populate both cache levels
        await self.l2.setex(key, 900, value)  # 15 min
        self.l1[key] = value  # 1 min

        return value

With a cache hit rate of 85%, we reduced external API time from 120ms to 40ms average.

The tricky part was cache invalidation. We used event-driven invalidation—when data changes at the source, it publishes an event, and we invalidate the relevant cache entries. It’s not perfect (there’s a brief inconsistency window), but the trade-off was acceptable for our use case.

Phase 4: Connection Pooling

Every time we talked to the database, we were establishing a new connection—a process that takes 20-50ms. We were essentially throwing away that connection after each request.

Database connection pools reuse connections across requests. We configured HikariCP, a high-performance connection pool:

@Configuration
public class DataSourceConfig {
    @Bean
    public DataSource dataSource() {
        HikariConfig config = new HikariConfig();

        // Size pool based on actual load
        int cores = Runtime.getRuntime().availableProcessors();
        config.setMaximumPoolSize(cores * 2 + 1);
        config.setMinimumIdle(5);

        // Connection timeouts
        config.setConnectionTimeout(5000);
        config.setIdleTimeout(600000);

        // Critical: prepare statement caching
        config.addDataSourceProperty("cachePrepStmts", "true");
        config.addDataSourceProperty("prepStmtCacheSize", "250");

        return new HikariDataSource(config);
    }
}

Connection pooling eliminated the connection overhead entirely. The 25ms connection cost was now amortized across many requests.

Phase 5: Parallelizing Independent Operations

Looking at our code, we noticed sequential operations that didn’t depend on each other:

// Sequential execution: 180ms total
Data data = fetchData(id);              // 60ms
Metadata metadata = fetchMetadata(id);  // 60ms
UserInfo user = fetchUserInfo(id);      // 60ms

These calls were independent—we didn’t need the result of one to make the next call. Why wait?

// Parallel execution: 60ms total (max of all three)
CompletableFuture<Data> dataFuture =
    CompletableFuture.supplyAsync(() -> fetchData(id));

CompletableFuture<Metadata> metadataFuture =
    CompletableFuture.supplyAsync(() -> fetchMetadata(id));

CompletableFuture<UserInfo> userFuture =
    CompletableFuture.supplyAsync(() -> fetchUserInfo(id));

// Wait for all to complete
CompletableFuture.allOf(dataFuture, metadataFuture, userFuture).join();

// Combine results
EnrichedData result = new EnrichedData(
    dataFuture.join(),
    metadataFuture.join(),
    userFuture.join()
);

Parallelizing I/O operations is one of the easiest performance wins. Modern CPUs have multiple cores sitting idle while waiting for network responses.

The Results

After systematic optimization:

ComponentBeforeAfterImprovement
Database180ms35ms5.1x
External APIs120ms40ms3x
Serialization30ms8ms3.75x
Connections25ms~0msEliminated
Application45ms17ms2.6x
Total p95400ms50ms8x

Key Lessons

Measure First: We spent a week instrumenting before writing optimization code. That investment paid off 10x.

Profile Everything: CPU profilers showed us that JSON serialization was expensive. We switched to Protocol Buffers for internal services.

The 80/20 Rule: Fixing the top 3 issues (N+1 queries, missing indexes, no caching) gave us 70% of the improvement.

Performance Degrades: We set up alerts for p95 latency >100ms. Six months later, they caught regressions we wouldn’t have noticed otherwise.

Test Under Load: Some optimizations that looked good in development actually made things worse under production load. Load testing caught these.

What Didn’t Work

Not every optimization succeeded:

  • Database sharding: Complexity wasn’t worth the gains for our scale
  • GraphQL: Added overhead without benefits for our use case
  • Aggressive caching: Cache invalidation bugs caused data inconsistencies

Sometimes the best optimization is not optimizing. Focus on what matters.

Continuous Monitoring

Performance optimization isn’t a one-time project. We now track latency breakdown in production dashboards and get alerted when any component regresses.

The journey from 400ms to 50ms taught us that performance is a feature, not an afterthought. Users notice the difference, and the systematic approach to finding and fixing bottlenecks works at any scale.