Event-driven architectures have become the backbone of modern distributed systems, enabling applications to process billions of events daily while maintaining loose coupling and high availability. In this post, I’ll share insights from building systems that process over 100 million events per day, covering the fundamental patterns and practical considerations that make or break these architectures at scale.
Why Event-Driven Architecture?
Traditional request-response architectures face significant challenges when dealing with high-volume, real-time data processing. Event-driven systems solve these problems by:
- Decoupling producers and consumers: Services don’t need to know about each other
- Enabling temporal decoupling: Producers and consumers don’t need to be online simultaneously
- Supporting multiple consumers: One event can trigger multiple downstream actions
- Facilitating system evolution: New consumers can be added without changing producers
Core Patterns for Event-Driven Systems
1. Event Sourcing vs Event Notification
Understanding the difference is critical. Event notification is lightweight:
public class UserLoginEvent {
private String userId;
private Instant timestamp;
// Minimal data - consumers fetch details if needed
}
Event sourcing captures complete state changes:
public class OrderPlacedEvent {
private String orderId;
private List<OrderItem> items;
private Address shippingAddress;
private PaymentDetails payment;
private Instant timestamp;
// Complete information to reconstruct state
}
For high-scale systems processing millions of events, I’ve found that event notification with metadata works best for most use cases, reserving event sourcing for critical business events where complete audit trails are required.
2. Topic Design and Partitioning Strategy
Kafka’s partitioning strategy directly impacts your system’s scalability. Here’s a production-tested approach:
class EventPartitioner:
def __init__(self, num_partitions):
self.num_partitions = num_partitions
def get_partition(self, event):
# Partition by tenant/customer ID for ordering guarantees
# within customer while maintaining parallelism across customers
if event.tenant_id:
return hash(event.tenant_id) % self.num_partitions
# Fall back to random partitioning for non-tenant events
return random.randint(0, self.num_partitions - 1)
Key considerations:
- Hot partitions: Monitor partition sizes and rebalance if needed
- Ordering guarantees: Messages with the same key go to the same partition
- Consumer group sizing: Number of consumers ≤ number of partitions
3. Handling Backpressure
When consumers can’t keep up with producers, you need backpressure mechanisms:
type EventProcessor struct {
inputChan chan Event
bufferSize int
semaphore chan struct{}
}
func (p *EventProcessor) ProcessWithBackpressure(ctx context.Context) {
// Limit concurrent processing
p.semaphore = make(chan struct{}, p.bufferSize)
for {
select {
case event := <-p.inputChan:
// Block if too many in-flight
p.semaphore <- struct{}{}
go func(e Event) {
defer func() { <-p.semaphore }()
p.process(e)
}(event)
case <-ctx.Done():
return
}
}
}
Production Lessons Learned
Schema Evolution
Never underestimate schema evolution complexity. Use schema registries and follow these rules:
- Always add fields as optional: New fields should have defaults
- Never remove required fields: Deprecate and add new versions instead
- Use schema versioning: Embed version numbers in events
{
"schemaVersion": "v2",
"eventType": "user.activity.recorded",
"timestamp": "2021-01-15T10:30:00Z",
"payload": {
"userId": "user-123",
"activityType": "login",
"metadata": {
"ipAddress": "192.168.1.1",
"userAgent": "Mozilla/5.0..."
}
}
}
Idempotency is Non-Negotiable
At scale, duplicate events are inevitable. Design for idempotency from day one:
public class IdempotentEventHandler {
private final EventStore eventStore;
public void handle(Event event) {
// Check if already processed
if (eventStore.exists(event.getId())) {
return; // Skip duplicate
}
try {
processEvent(event);
eventStore.markProcessed(event.getId());
} catch (Exception e) {
// Don't mark as processed on failure
throw e;
}
}
}
Dead Letter Queues and Error Handling
Implement comprehensive error handling with DLQs:
class EventConsumer:
def __init__(self, kafka_consumer, dlq_producer):
self.consumer = kafka_consumer
self.dlq = dlq_producer
self.max_retries = 3
def process_with_dlq(self, event):
retry_count = event.metadata.get('retry_count', 0)
try:
self.process(event)
except RetriableError as e:
if retry_count < self.max_retries:
event.metadata['retry_count'] = retry_count + 1
self.consumer.retry(event)
else:
self.dlq.send(event, error=str(e))
except NonRetriableError as e:
self.dlq.send(event, error=str(e))
Observability for Event-Driven Systems
Monitoring distributed event flows requires specialized approaches:
End-to-End Tracing
Implement correlation IDs to track events through the system:
type Event struct {
ID string
CorrelationID string
CausationID string
Timestamp time.Time
Payload interface{}
}
func (e *Event) CreateChild(eventType string) Event {
return Event{
ID: generateID(),
CorrelationID: e.CorrelationID, // Same correlation
CausationID: e.ID, // Parent event ID
Timestamp: time.Now(),
}
}
Key Metrics to Monitor
- Producer metrics: Message production rate, failure rate, serialization time
- Consumer metrics: Lag (critical!), processing time, error rates
- Infrastructure metrics: Partition distribution, replication lag, disk usage
metrics = {
'consumer_lag': consumer.lag(), # Messages behind
'processing_time_p99': percentile(processing_times, 99),
'error_rate': errors / total_processed,
'throughput': messages_per_second
}
Scaling Considerations
When to Add Partitions
Adding partitions is a one-way operation in Kafka. Consider:
- Current max throughput per partition: ~10MB/s
- Consumer processing capability
- Ordering requirements (more partitions = less ordering)
Consumer Group Management
Properties props = new Properties();
props.put("group.id", "event-processor-group");
props.put("max.poll.records", 500); // Tune based on processing time
props.put("session.timeout.ms", 30000);
props.put("heartbeat.interval.ms", 3000);
// Enable auto-commit only if processing is idempotent
props.put("enable.auto.commit", "false");
Anti-Patterns to Avoid
- Event chains that are too long: More than 3-4 hops makes debugging nightmarish
- Putting too much data in events: Keep events focused and lean
- Synchronous event processing: Defeats the purpose of async architecture
- No retry strategy: Networks fail, plan for it
- Ignoring consumer lag: Lag is your canary in the coal mine
Conclusion
Event-driven architectures enable incredible scale, but they require careful design and operational discipline. Focus on idempotency, observability, and proper error handling from the start. The patterns shared here have proven effective in systems processing hundreds of millions of events daily, but remember: every system has unique requirements. Use these as starting points and adapt based on your specific needs.
The key to success is starting simple, measuring everything, and scaling incrementally as you learn your system’s behavior under load.