Event-Driven Architectures: Building Systems That Scale to Billions of Events

Event-driven architectures have become the backbone of modern distributed systems, enabling applications to process billions of events daily while maintaining loose coupling and high availability. In this post, I’ll share insights from building systems that process over 100 million events per day, covering the fundamental patterns and practical considerations that make or break these architectures at scale.

Why Event-Driven Architecture?

Traditional request-response architectures face significant challenges when dealing with high-volume, real-time data processing. Event-driven systems solve these problems by:

Decoupling producers and consumers: Services don’t need to know about each other
Enabling temporal decoupling: Producers and consumers don’t need to be online simultaneously
Supporting multiple consumers: One event can trigger multiple downstream actions
Facilitating system evolution: New consumers can be added without changing producers

Core Patterns for Event-Driven Systems

1. Event Sourcing vs Event Notification

Understanding the difference is critical. Event notification is lightweight:

public class UserLoginEvent {
    private String userId;
    private Instant timestamp;
    // Minimal data - consumers fetch details if needed
}

Event sourcing captures complete state changes:

public class OrderPlacedEvent {
    private String orderId;
    private List<OrderItem> items;
    private Address shippingAddress;
    private PaymentDetails payment;
    private Instant timestamp;
    // Complete information to reconstruct state
}

For high-scale systems processing millions of events, I’ve found that event notification with metadata works best for most use cases, reserving event sourcing for critical business events where complete audit trails are required.

2. Topic Design and Partitioning Strategy

Kafka’s partitioning strategy directly impacts your system’s scalability. Here’s a production-tested approach:

class EventPartitioner:
    def __init__(self, num_partitions):
        self.num_partitions = num_partitions

    def get_partition(self, event):
        # Partition by tenant/customer ID for ordering guarantees
        # within customer while maintaining parallelism across customers
        if event.tenant_id:
            return hash(event.tenant_id) % self.num_partitions

        # Fall back to random partitioning for non-tenant events
        return random.randint(0, self.num_partitions - 1)

Key considerations:

Hot partitions: Monitor partition sizes and rebalance if needed
Ordering guarantees: Messages with the same key go to the same partition
Consumer group sizing: Number of consumers ≤ number of partitions

3. Handling Backpressure

When consumers can’t keep up with producers, you need backpressure mechanisms:

type EventProcessor struct {
    inputChan  chan Event
    bufferSize int
    semaphore  chan struct{}
}

func (p *EventProcessor) ProcessWithBackpressure(ctx context.Context) {
    // Limit concurrent processing
    p.semaphore = make(chan struct{}, p.bufferSize)

    for {
        select {
        case event := <-p.inputChan:
            // Block if too many in-flight
            p.semaphore <- struct{}{}

            go func(e Event) {
                defer func() { <-p.semaphore }()
                p.process(e)
            }(event)

        case <-ctx.Done():
            return
        }
    }
}

Production Lessons Learned

Schema Evolution

Never underestimate schema evolution complexity. Use schema registries and follow these rules:

Always add fields as optional: New fields should have defaults
Never remove required fields: Deprecate and add new versions instead
Use schema versioning: Embed version numbers in events

{
  "schemaVersion": "v2",
  "eventType": "user.activity.recorded",
  "timestamp": "2021-01-15T10:30:00Z",
  "payload": {
    "userId": "user-123",
    "activityType": "login",
    "metadata": {
      "ipAddress": "192.168.1.1",
      "userAgent": "Mozilla/5.0..."
    }
  }
}

Idempotency is Non-Negotiable

At scale, duplicate events are inevitable. Design for idempotency from day one:

public class IdempotentEventHandler {
    private final EventStore eventStore;

    public void handle(Event event) {
        // Check if already processed
        if (eventStore.exists(event.getId())) {
            return; // Skip duplicate
        }

        try {
            processEvent(event);
            eventStore.markProcessed(event.getId());
        } catch (Exception e) {
            // Don't mark as processed on failure
            throw e;
        }
    }
}

Dead Letter Queues and Error Handling

Implement comprehensive error handling with DLQs:

class EventConsumer:
    def __init__(self, kafka_consumer, dlq_producer):
        self.consumer = kafka_consumer
        self.dlq = dlq_producer
        self.max_retries = 3

    def process_with_dlq(self, event):
        retry_count = event.metadata.get('retry_count', 0)

        try:
            self.process(event)
        except RetriableError as e:
            if retry_count < self.max_retries:
                event.metadata['retry_count'] = retry_count + 1
                self.consumer.retry(event)
            else:
                self.dlq.send(event, error=str(e))
        except NonRetriableError as e:
            self.dlq.send(event, error=str(e))

Observability for Event-Driven Systems

Monitoring distributed event flows requires specialized approaches:

End-to-End Tracing

Implement correlation IDs to track events through the system:

type Event struct {
    ID            string
    CorrelationID string
    CausationID   string
    Timestamp     time.Time
    Payload       interface{}
}

func (e *Event) CreateChild(eventType string) Event {
    return Event{
        ID:            generateID(),
        CorrelationID: e.CorrelationID, // Same correlation
        CausationID:   e.ID,            // Parent event ID
        Timestamp:     time.Now(),
    }
}

Key Metrics to Monitor

Producer metrics: Message production rate, failure rate, serialization time
Consumer metrics: Lag (critical!), processing time, error rates
Infrastructure metrics: Partition distribution, replication lag, disk usage

metrics = {
    'consumer_lag': consumer.lag(),  # Messages behind
    'processing_time_p99': percentile(processing_times, 99),
    'error_rate': errors / total_processed,
    'throughput': messages_per_second
}

Scaling Considerations

When to Add Partitions

Adding partitions is a one-way operation in Kafka. Consider:

Current max throughput per partition: ~10MB/s
Consumer processing capability
Ordering requirements (more partitions = less ordering)

Consumer Group Management

Properties props = new Properties();
props.put("group.id", "event-processor-group");
props.put("max.poll.records", 500);  // Tune based on processing time
props.put("session.timeout.ms", 30000);
props.put("heartbeat.interval.ms", 3000);

// Enable auto-commit only if processing is idempotent
props.put("enable.auto.commit", "false");

Anti-Patterns to Avoid

Event chains that are too long: More than 3-4 hops makes debugging nightmarish
Putting too much data in events: Keep events focused and lean
Synchronous event processing: Defeats the purpose of async architecture
No retry strategy: Networks fail, plan for it
Ignoring consumer lag: Lag is your canary in the coal mine

Conclusion

Event-driven architectures enable incredible scale, but they require careful design and operational discipline. Focus on idempotency, observability, and proper error handling from the start. The patterns shared here have proven effective in systems processing hundreds of millions of events daily, but remember: every system has unique requirements. Use these as starting points and adapt based on your specific needs.

The key to success is starting simple, measuring everything, and scaling incrementally as you learn your system’s behavior under load.