System Design for 100M+ Events Per Day: Architecture Patterns and Lessons

Processing 100 million events per day means handling over 1,000 events per second continuously, with peaks potentially 10x higher. At this scale, architectural decisions that seem minor at lower volumes become critical. After building and operating systems at this scale, I’ve learned that the architecture must be designed for failure, optimized for the common case, and instrumented for understanding behavior under load.

Scale Characteristics and Constraints

Understanding the Numbers

100 million events per day translates to:

Average: 1,157 events/second
Peak: 5,000-10,000 events/second (assuming 5-10x peak factor)
Daily data volume: 100GB-1TB depending on event size
Monthly data: 3-30TB
Annual retention: 36-360TB

These numbers drive architectural decisions:

Database writes: thousands per second
Network throughput: sustained gigabits per second
Storage: terabytes to petabytes
Processing: thousands of CPU cores

System Constraints at Scale

Write Throughput Dominance:

Reads are typically less than 10% of writes
Cannot rely on caching for writes
Write path is the critical path
Optimization focuses on ingestion

Partitioning Required:

Single-node systems cannot handle this load
Horizontal partitioning essential
Careful partition key selection critical
Rebalancing strategies needed

Eventual Consistency Acceptable:

Strong consistency too expensive at scale
Eventual consistency with bounded lag
Design for idempotency
Compensating transactions for corrections

Cost Constraints:

Infrastructure costs become significant
Efficiency directly impacts profitability
Resource optimization crucial
Monitoring costs themselves become non-trivial

Ingestion Architecture Patterns

Multi-Stage Ingestion Pipeline

High-throughput ingestion requires staged processing:

Stage 1: Edge Collection:

Accept events from producers
Minimal validation (malformed requests rejected)
Immediate acknowledgment
Load balancing across collectors
Geographic distribution for latency

Stage 2: Buffering and Batching:

Queue or stream buffer (Kafka, Kinesis)
Absorbs traffic spikes
Enables batching for efficiency
Provides durability guarantee
Decouples collection from processing

Stage 3: Validation and Enrichment:

Schema validation
Business rule checking
Data enrichment (lookups, transformations)
Dead letter queue for invalid events
Parallel processing for throughput

Stage 4: Storage and Indexing:

Batch writes to storage
Update indexes
Update aggregates
Replicate to read stores
Archive to cold storage

Architectural Benefits:

Each stage optimized independently
Failure in one stage doesn’t cascade
Backpressure managed through queues
Easy to add processing stages

Trade-offs:

Higher complexity
End-to-end latency increased
More infrastructure components
Operational overhead

Load Balancing Strategies

Distributing load across ingestion nodes:

Geographic Load Balancing:

Route to nearest data center
Reduces latency for producers
Regional failure isolation
Compliance with data residency

Consistent Hashing:

Deterministic routing
Minimal disruption on node changes
Enables local caching
Uneven distribution with skewed keys

Least Connections:

Route to least busy node
Better balance with variable request costs
Requires connection state tracking
More complex than round-robin

Weighted Round-Robin:

Account for heterogeneous nodes
Simple and predictable
Static weights may not adapt to load
Good for homogeneous infrastructure

The choice depends on client distribution, event size variation, and infrastructure homogeneity.

Storage Architecture for High Write Throughput

Write-Optimized Storage Patterns

Log-Structured Merge (LSM) Trees:

Sequential writes to log
Background compaction
Excellent write throughput
Read amplification trade-off
Used by Cassandra, RocksDB, HBase

Append-Only Logs:

Pure sequential writes
No updates or deletes
Compaction for cleanup
Excellent write performance
Limited query capabilities

Time-Series Databases:

Optimized for time-ordered writes
Downsampling for retention
Efficient range queries
Limited secondary indexes
Examples: InfluxDB, TimescaleDB, ClickHouse

Partitioned Relational Databases:

Time-based or key-based partitioning
Partition pruning for queries
Parallel writes to partitions
More complex management
Examples: PostgreSQL with partitioning, MySQL sharding

Partitioning Strategies

Time-Based Partitioning:

One partition per time period (hour, day)
Efficient for time-range queries
Easy to archive old partitions
May have hot partition for current time
Predictable growth pattern

Hash-Based Partitioning:

Hash entity ID to partition
Even distribution
Enables horizontal scaling
Difficult to rebalance
Range queries require all partitions

Range-Based Partitioning:

Partition by entity ID ranges
Enables range queries
Uneven distribution with skewed data
Easier rebalancing than hash
Requires careful boundary selection

Composite Partitioning:

Combine strategies (time + hash)
Balance benefits of multiple approaches
Time-based primary, hash-based secondary
More complex but powerful
Common at large scale

For event systems, composite time + entity hash partitioning often works best.

Hot Partition Problem

Skewed write distribution to partitions:

Causes:

Popular entities (celebrities, trending topics)
Time-based patterns (start of hour)
Key hashing collisions
Poor partition key choice

Mitigation Strategies:

Partition Splitting:

Split hot partitions into multiple
Redistribute keys
Requires data migration
Temporary disruption

Sub-Partitioning:

Add random suffix to hot keys
Spread across multiple partitions
Requires gathering on read
Effective for write-heavy, read-light

Quota Management:

Rate limit high-volume entities
Prevent single entity overwhelming system
Business decision on fairness
May require separate tier for high-volume

Separate Paths:

Dedicated infrastructure for high-volume entities
Isolates impact
Higher complexity
Economical for few high-volume entities

Processing Architecture Patterns

Parallel Processing with Exactly-Once Semantics

At scale, failures are continuous. Processing must handle duplicates.

Idempotency Patterns:

Unique Event IDs:

Producer assigns unique ID
Deduplicate based on ID
Store processed IDs
Bloom filter for quick check

Conditional Writes:

Database supports conditional updates
Check before modifying
Atomic test-and-set
Prevents duplicate effects

Versioning:

Every state change has version number
Only apply if version matches
Prevents out-of-order updates
Enables conflict detection

State Store for Deduplication:

Track processed event IDs
TTL for old entries
Distributed cache (Redis, Memcached)
Balance storage vs processing duplication

Backpressure and Flow Control

Preventing system overload:

Queue-Based Backpressure:

Fixed queue size limits
Producers block when queue full
Provides natural pushback
May cause producer timeouts

Rate Limiting:

Limit ingestion rate
Smooth traffic spikes
May reject events during peaks
Requires retry logic

Dynamic Scaling:

Add processing capacity under load
Auto-scaling based on queue depth
Takes time to spin up (minutes)
Cost-effective but not instant

Graceful Degradation:

Sample events during overload
Process subset, drop rest
Maintain system stability
Accept partial data loss

The architectural choice depends on whether you can afford to drop events or must process all.

Batch Processing Optimization

Processing efficiency improves with batching:

Micro-Batching:

Accumulate events for short period (100ms-1s)
Process batch together
Amortize per-event overhead
Balance latency and throughput

Batch Size Tuning:

Too small: high overhead
Too large: high latency, memory pressure
Depends on event size and processing
Typically 100-1000 events per batch

Batch Writing:

Single write for batch vs per-event writes
10-100x performance improvement
Requires transaction support or accepts partial failure
Critical for database writes

Aggregation and Real-Time Analytics

Pre-Aggregation Patterns

Computing aggregates incrementally:

Streaming Aggregation:

Update aggregates as events arrive
Maintain state for active windows
Emit results periodically
Windowing strategies (tumbling, sliding, session)

Lambda Architecture for Aggregates:

Real-time: approximate aggregates from stream
Batch: accurate aggregates from historical data
Merge at query time
Balance freshness and accuracy

Materialized Views:

Incrementally maintain aggregate tables
Serve from pre-computed results
Background recomputation for accuracy
Trade storage for query speed

Approximate Aggregations:

Sketches and probabilistic structures
Count-min sketch for frequency
HyperLogLog for cardinality
Significant space savings, bounded error

At 100M events/day, pre-aggregation is essential. Cannot compute aggregates on-demand.

Time Window Management

Tumbling Windows:

Fixed, non-overlapping time periods
Each event in exactly one window
Simple to implement
Arbitrary boundaries (event at 12:00:00.001 vs 11:59:59.999)

Sliding Windows:

Overlapping time periods
Each event in multiple windows
Smoother transitions
Higher computation cost

Session Windows:

Gap-based grouping
Window ends after inactivity period
Variable window sizes
Useful for user sessions

Watermarking:

Handle late-arriving events
Determine when window is complete
Trade-off between latency and completeness
Critical for correctness

Fault Tolerance and Reliability

Failure Modes at Scale

With thousands of events per second, failures are constant:

Expected Failure Rates:

Infrastructure: 0.1% of requests fail
Network: transient errors common
Dependencies: occasional unavailability
At 1000/sec, that’s 1 failure/second

Design for Failure:

Retries with exponential backoff
Circuit breakers for failing dependencies
Fallbacks and degraded modes
Isolation to prevent cascade failures

Checkpointing and Recovery

Periodic Checkpointing:

Save processing state regularly
Balance checkpoint frequency vs recovery time
Asynchronous checkpointing to avoid blocking
Checkpoint storage durability critical

Event Log Replay:

Reprocess from event log after failure
Requires idempotent processing
Can take hours for large backlogs
Combine with checkpoints for faster recovery

Multi-Version Concurrency:

Keep multiple versions of state
Roll back to previous version on corruption
Higher storage cost
Enables fast recovery

Monitoring and Alerting at Scale

Key Metrics:

Throughput Metrics:

Events per second (p50, p95, p99)
Bytes per second
Batch size distribution
Trends over time

Latency Metrics:

End-to-end latency
Per-stage latency
Percentiles (p95, p99, p99.9)
Latency budgets

Error Metrics:

Error rate by type
Retry rate
Dead letter queue size
Downstream error propagation

Resource Metrics:

CPU utilization
Memory usage
Network throughput
Disk I/O
Queue depths

Alert Thresholds:

Anomaly-based rather than static
Account for daily/weekly patterns
Multiple severity levels
Alert fatigue prevention

Performance Optimization Patterns

Hot Path Optimization

The ingestion path is critical:

Minimize Allocations:

Object pooling for frequent allocations
Reduce garbage collection pressure
Pre-allocate buffers
Zero-copy techniques where possible

Avoid Synchronous I/O:

Async I/O for network and disk
Non-blocking operations
Callbacks or futures
Event-driven architecture

Batch Database Operations:

Bulk inserts instead of individual
Prepared statements
Connection pooling
Reduce round-trips

Caching Frequently Accessed Data:

Reference data in memory
Configuration cached
Lookups memoized
Cache invalidation strategy

Network Optimization

Network often bottleneck at scale:

Protocol Selection:

Binary protocols (Protobuf, Avro) over JSON
Compression for large payloads
HTTP/2 for multiplexing
gRPC for service-to-service

Connection Management:

Connection pooling and reuse
Keep-alive connections
Limit connection overhead
Balance pool size vs resource usage

Payload Optimization:

Remove unnecessary fields
Compact representations
Compression (gzip, zstd)
Delta encoding for updates

Database Optimization

Database writes often the bottleneck:

Write Optimization:

Batch writes (100-1000 records)
Use bulk insert APIs
Disable/defer index updates during bulk insert
Partition for parallel writes

Index Strategy:

Minimal indexes on high-write tables
Covering indexes for frequent queries
Defer index creation to batch process
Partial indexes for subset queries

Connection Pooling:

Reuse database connections
Size pool for concurrency
Monitor connection usage
Avoid connection exhaustion

Cost Optimization at Scale

At 100M events/day, costs matter:

Infrastructure Right-Sizing:

Profile actual resource usage
Avoid over-provisioning
Use auto-scaling for variable load
Reserved capacity for baseline

Tiered Storage:

Hot data (recent) on fast storage
Warm data on standard storage
Cold data (archive) on cheap storage
Automatic lifecycle management

Compression:

Compress stored events
5-10x size reduction typical
CPU cost vs storage savings
Especially valuable for long retention

Sampling and Aggregation:

Store raw events for limited period
Store aggregates for longer
Reduces storage costs 10-100x
Accept loss of granularity for old data

Conclusion

Building systems that process 100M+ events per day requires architectural patterns specifically designed for scale:

Multi-stage ingestion with buffering to handle spikes and decouple stages
Write-optimized storage with appropriate partitioning for throughput
Idempotent processing to handle inevitable failures and duplicates
Pre-aggregation for real-time analytics, cannot compute on-demand
Comprehensive monitoring to understand behavior at scale
Cost optimization as infrastructure costs become significant

The key insight: at this scale, you cannot reason about individual events. Architecture must handle events in aggregate, optimize for throughput over latency for individual events, and design for continuous partial failures.

Start with these patterns from the beginning. Retrofitting them into a system designed for lower scale is far more difficult than building them in from the start. The architecture that works at 1M events/day will not work at 100M. Plan for scale even if you’re not there yet.