Latency is a feature. In a world where users expect instant responses, every millisecond matters. This post chronicles our journey optimizing a high-traffic API platform from 400ms to 50ms p95 latency—an 8x improvement. The techniques shared here apply to any high-scale distributed system.
Understanding the Baseline
Before optimizing, we needed to understand where time was being spent. Our initial architecture looked like this:
User Request → Load Balancer → API Gateway → Service → Database
~10ms ~20ms ~350ms ~20ms
The 350ms in the service layer was our primary target.
Measuring Everything
You can’t improve what you can’t measure. We instrumented the entire request path:
@Component
public class LatencyMonitor {
private final MeterRegistry metrics;
public <T> T measureLatency(String operation, Supplier<T> supplier) {
Timer.Sample sample = Timer.start(metrics);
long startNanos = System.nanoTime();
try {
T result = supplier.get();
long durationNanos = System.nanoTime() - startNanos;
// Record to metrics
sample.stop(metrics.timer("latency",
"operation", operation,
"status", "success"
));
// Log slow operations
if (durationNanos > 100_000_000) { // 100ms
logger.warn("Slow operation: {} took {}ms",
operation,
durationNanos / 1_000_000
);
}
return result;
} catch (Exception e) {
sample.stop(metrics.timer("latency",
"operation", operation,
"status", "error"
));
throw e;
}
}
}
// Usage
@Service
public class EventService {
public Event getEvent(String eventId) {
return latencyMonitor.measureLatency("getEvent", () -> {
Event event = latencyMonitor.measureLatency("db.query",
() -> database.query(eventId)
);
latencyMonitor.measureLatency("enrichment",
() -> enricher.enrich(event)
);
return event;
});
}
}
This revealed our top latency contributors:
- Database queries: 180ms
- External service calls: 120ms
- Data serialization: 30ms
- Business logic: 20ms
Optimization 1: Database Query Optimization
Problem: N+1 Query Pattern
// BEFORE: N+1 queries (slow!)
@Service
public class EventService {
public List<EnrichedEvent> getEvents(List<String> eventIds) {
List<EnrichedEvent> results = new ArrayList<>();
for (String eventId : eventIds) {
Event event = eventRepository.findById(eventId); // 1 query
User user = userRepository.findById(event.getUserId()); // N queries!
results.add(new EnrichedEvent(event, user));
}
return results;
}
}
Solution: Batch Loading
// AFTER: 2 queries total
@Service
public class OptimizedEventService {
public List<EnrichedEvent> getEvents(List<String> eventIds) {
// Single query for all events
List<Event> events = eventRepository.findAllById(eventIds);
// Single query for all users
Set<String> userIds = events.stream()
.map(Event::getUserId)
.collect(Collectors.toSet());
Map<String, User> users = userRepository.findAllById(userIds)
.stream()
.collect(Collectors.toMap(User::getId, Function.identity()));
// Combine in memory
return events.stream()
.map(event -> new EnrichedEvent(event, users.get(event.getUserId())))
.collect(Collectors.toList());
}
}
Result: Database query time reduced from 180ms → 40ms
Adding Strategic Indexes
-- BEFORE: Full table scan
SELECT * FROM events
WHERE user_id = ? AND created_at > ?
ORDER BY created_at DESC
LIMIT 100;
-- Add composite index
CREATE INDEX idx_events_user_created
ON events(user_id, created_at DESC);
-- AFTER: Index scan
EXPLAIN SELECT * FROM events
WHERE user_id = ? AND created_at > ?
ORDER BY created_at DESC
LIMIT 100;
-- Query plan now shows:
-- Index Scan using idx_events_user_created on events (cost=0.43..123.45 rows=100)
Optimization 2: Caching Strategy
Distributed Caching with Redis
@Service
public class CachedEventService {
private final RedisTemplate<String, Event> redis;
private final EventRepository repository;
private final Duration cacheTTL = Duration.ofMinutes(15);
public Event getEvent(String eventId) {
// Try cache first
String cacheKey = "event:" + eventId;
Event cached = redis.opsForValue().get(cacheKey);
if (cached != null) {
metrics.counter("cache.hit", "entity", "event").increment();
return cached;
}
metrics.counter("cache.miss", "entity", "event").increment();
// Load from database
Event event = repository.findById(eventId)
.orElseThrow(() -> new EventNotFoundException(eventId));
// Populate cache
redis.opsForValue().set(cacheKey, event, cacheTTL);
return event;
}
public void updateEvent(Event event) {
// Update database
repository.save(event);
// Invalidate cache
redis.delete("event:" + event.getId());
// Also invalidate any derived caches
redis.delete("user:events:" + event.getUserId());
}
}
Local Caching for Hot Data
type LocalCache struct {
cache *lru.Cache
ttl time.Duration
mu sync.RWMutex
}
type cacheEntry struct {
value interface{}
expiration time.Time
}
func (c *LocalCache) Get(key string) (interface{}, bool) {
c.mu.RLock()
defer c.mu.RUnlock()
if entry, ok := c.cache.Get(key); ok {
e := entry.(*cacheEntry)
if time.Now().Before(e.expiration) {
return e.value, true
}
// Expired
c.cache.Remove(key)
}
return nil, false
}
func (c *LocalCache) Set(key string, value interface{}) {
c.mu.Lock()
defer c.mu.Unlock()
c.cache.Add(key, &cacheEntry{
value: value,
expiration: time.Now().Add(c.ttl),
})
}
// Usage: two-tier caching
func (s *Service) GetUser(userID string) (*User, error) {
// L1: Local cache (microseconds)
if user, ok := s.localCache.Get(userID); ok {
return user.(*User), nil
}
// L2: Redis cache (milliseconds)
if user, err := s.redisCache.Get(userID); err == nil {
s.localCache.Set(userID, user)
return user, nil
}
// L3: Database (tens of milliseconds)
user, err := s.db.GetUser(userID)
if err != nil {
return nil, err
}
s.redisCache.Set(userID, user)
s.localCache.Set(userID, user)
return user, nil
}
Result: Cache hit rate of 85%, reducing average data access from 40ms → 5ms
Optimization 3: Parallel Processing
Before: Sequential Processing
def process_event(event_id):
# Sequential - takes sum of all operations
event = db.get_event(event_id) # 20ms
user = db.get_user(event.user_id) # 20ms
metadata = external_service.get_metadata(event.metadata_id) # 60ms
enriched = enrich(event, user, metadata) # 10ms
return enriched
# Total: 110ms
After: Parallel Processing
import asyncio
from concurrent.futures import ThreadPoolExecutor
async def process_event_parallel(event_id):
# Parallel - takes max of all operations
# Fire all I/O operations concurrently
event_task = asyncio.create_task(async_db.get_event(event_id))
# Get user and metadata in parallel once we have event
event = await event_task
user_task = asyncio.create_task(async_db.get_user(event.user_id))
metadata_task = asyncio.create_task(
async_external_service.get_metadata(event.metadata_id)
)
# Wait for both
user, metadata = await asyncio.gather(user_task, metadata_task)
# CPU-bound work
enriched = enrich(event, user, metadata)
return enriched
# Total: 20ms + max(20ms, 60ms) + 10ms = 90ms
For Java, use CompletableFuture:
public CompletableFuture<EnrichedEvent> processEventParallel(String eventId) {
return eventRepository.findByIdAsync(eventId)
.thenCompose(event -> {
// Launch parallel operations
CompletableFuture<User> userFuture =
userRepository.findByIdAsync(event.getUserId());
CompletableFuture<Metadata> metadataFuture =
externalService.getMetadataAsync(event.getMetadataId());
// Combine results
return CompletableFuture.allOf(userFuture, metadataFuture)
.thenApply(v -> enricher.enrich(
event,
userFuture.join(),
metadataFuture.join()
));
});
}
Result: External service integration time reduced from 120ms → 60ms
Optimization 4: Serialization Optimization
Problem: Jackson Default Serialization
// BEFORE: Reflection-based serialization
@Data
public class Event {
private String id;
private String userId;
private String type;
private Map<String, Object> attributes;
private Instant timestamp;
}
ObjectMapper mapper = new ObjectMapper();
String json = mapper.writeValueAsString(event); // 15ms for complex objects
Solution: Optimized Serialization
// Use afterburner module for bytecode generation
ObjectMapper mapper = new ObjectMapper()
.registerModule(new AfterburnerModule())
// Don't serialize nulls
.setSerializationInclusion(JsonInclude.Include.NON_NULL)
// Use optimized date serialization
.disable(SerializationFeature.WRITE_DATES_AS_TIMESTAMPS);
// For ultra-hot paths, use manual serialization
public class FastEventSerializer {
private static final byte[] ID_PREFIX = "\"id\":\"".getBytes();
private static final byte[] USER_ID_PREFIX = "\",\"userId\":\"".getBytes();
// ... other fields
public byte[] serialize(Event event) {
ByteArrayOutputStream baos = new ByteArrayOutputStream(512);
try {
baos.write('{');
baos.write(ID_PREFIX);
baos.write(event.getId().getBytes());
baos.write(USER_ID_PREFIX);
baos.write(event.getUserId().getBytes());
// ... serialize other fields
baos.write('}');
} catch (IOException e) {
throw new RuntimeException(e);
}
return baos.toByteArray();
}
}
For even better performance, use protobuf:
syntax = "proto3";
message Event {
string id = 1;
string user_id = 2;
string type = 3;
map<string, string> attributes = 4;
int64 timestamp = 5;
}
// Protobuf serialization is 3-5x faster
Event event = Event.newBuilder()
.setId(id)
.setUserId(userId)
.setType(type)
.setTimestamp(System.currentTimeMillis())
.build();
byte[] bytes = event.toByteArray(); // 3ms
Result: Serialization time reduced from 30ms → 8ms
Optimization 5: Connection Pooling
// BEFORE: Creating new connections
public Event getEvent(String id) {
try (Connection conn = DriverManager.getConnection(url, user, password)) {
// Query - connection establishment adds 20-50ms!
}
}
// AFTER: Connection pooling
@Configuration
public class DataSourceConfig {
@Bean
public DataSource dataSource() {
HikariConfig config = new HikariConfig();
config.setJdbcUrl(jdbcUrl);
config.setUsername(username);
config.setPassword(password);
// Tune pool size based on load
config.setMaximumPoolSize(50);
config.setMinimumIdle(10);
// Connection validation
config.setConnectionTimeout(5000);
config.setIdleTimeout(600000);
config.setMaxLifetime(1800000);
// Performance tuning
config.addDataSourceProperty("cachePrepStmts", "true");
config.addDataSourceProperty("prepStmtCacheSize", "250");
config.addDataSourceProperty("prepStmtCacheSqlLimit", "2048");
return new HikariDataSource(config);
}
}
Result: Database connection overhead reduced from ~25ms → ~0ms (amortized)
Optimization 6: Reducing Payload Size
// BEFORE: Sending everything
@GetMapping("/events/{id}")
public Event getEvent(@PathVariable String id) {
Event event = eventService.get(id);
return event; // 50KB response
}
// AFTER: Field filtering
@GetMapping("/events/{id}")
public EventDTO getEvent(
@PathVariable String id,
@RequestParam(required = false) Set<String> fields) {
Event event = eventService.get(id);
if (fields == null || fields.isEmpty()) {
return eventMapper.toDTO(event);
}
// Return only requested fields
return eventMapper.toDTO(event, fields); // 5KB response
}
// Client specifies fields
// GET /events/123?fields=id,type,timestamp
Smaller payloads = faster network transfer and serialization.
Optimization 7: Algorithm Improvements
Sometimes the biggest wins come from algorithmic changes:
# BEFORE: O(n²) algorithm
def find_related_events(event_id, all_events):
related = []
target = get_event(event_id)
for event in all_events:
if is_related(target, event): # Expensive comparison
related.append(event)
return related
# For 10,000 events: ~5 seconds
# AFTER: O(n) with indexing
class EventIndex:
def __init__(self):
self.by_user = defaultdict(list)
self.by_type = defaultdict(list)
self.by_tag = defaultdict(set)
def index_event(self, event):
self.by_user[event.user_id].append(event.id)
self.by_type[event.type].append(event.id)
for tag in event.tags:
self.by_tag[tag].add(event.id)
def find_related(self, event_id):
event = get_event(event_id)
# Use indexes to find candidates
candidates = set()
candidates.update(self.by_user[event.user_id])
candidates.update(self.by_type[event.type])
for tag in event.tags:
candidates.update(self.by_tag[tag])
candidates.discard(event_id)
return list(candidates)
# For 10,000 events: ~50ms
Monitoring the Improvements
Track latency improvements with percentile metrics:
@Component
public class LatencyTracker {
private final MeterRegistry registry;
public void recordLatency(String endpoint, long durationMs) {
registry.timer("http.request.duration",
"endpoint", endpoint
).record(durationMs, TimeUnit.MILLISECONDS);
}
// Query percentiles
public LatencyStats getStats(String endpoint) {
Timer timer = registry.find("http.request.duration")
.tag("endpoint", endpoint)
.timer();
return new LatencyStats(
timer.mean(TimeUnit.MILLISECONDS),
timer.percentile(0.50), // p50 (median)
timer.percentile(0.95), // p95
timer.percentile(0.99), // p99
timer.max(TimeUnit.MILLISECONDS)
);
}
}
Results Summary
| Optimization | Before | After | Improvement |
|---|---|---|---|
| Database queries | 180ms | 40ms | 4.5x |
| Caching layer | - | -35ms | New |
| External services | 120ms | 60ms | 2x |
| Serialization | 30ms | 8ms | 3.75x |
| Connection pooling | 25ms | ~0ms | ∞ |
| Total p95 | 400ms | 50ms | 8x |
Key Takeaways
- Measure first: Instrument everything before optimizing
- Fix the biggest problems: Focus on operations taking >10% of total time
- Database is often the bottleneck: Optimize queries, add indexes, batch operations
- Caching is powerful: But manage invalidation carefully
- Parallelize I/O: Don’t wait for sequential operations
- Reduce data transfer: Smaller payloads are faster
- Monitor continuously: Latency degrades over time without vigilance
Latency optimization is an ongoing journey, not a one-time project. Build monitoring, measurement, and optimization into your development culture.