Processing 100 million events per day means handling over 1,000 events per second continuously, with peaks potentially 10x higher. At this scale, architectural decisions that seem minor at lower volumes become critical. After building and operating systems at this scale, I’ve learned that the architecture must be designed for failure, optimized for the common case, and instrumented for understanding behavior under load.
Scale Characteristics and Constraints
Understanding the Numbers
100 million events per day translates to:
- Average: 1,157 events/second
- Peak: 5,000-10,000 events/second (assuming 5-10x peak factor)
- Daily data volume: 100GB-1TB depending on event size
- Monthly data: 3-30TB
- Annual retention: 36-360TB
These numbers drive architectural decisions:
- Database writes: thousands per second
- Network throughput: sustained gigabits per second
- Storage: terabytes to petabytes
- Processing: thousands of CPU cores
System Constraints at Scale
Write Throughput Dominance:
- Reads are typically less than 10% of writes
- Cannot rely on caching for writes
- Write path is the critical path
- Optimization focuses on ingestion
Partitioning Required:
- Single-node systems cannot handle this load
- Horizontal partitioning essential
- Careful partition key selection critical
- Rebalancing strategies needed
Eventual Consistency Acceptable:
- Strong consistency too expensive at scale
- Eventual consistency with bounded lag
- Design for idempotency
- Compensating transactions for corrections
Cost Constraints:
- Infrastructure costs become significant
- Efficiency directly impacts profitability
- Resource optimization crucial
- Monitoring costs themselves become non-trivial
Ingestion Architecture Patterns
Multi-Stage Ingestion Pipeline
High-throughput ingestion requires staged processing:
Stage 1: Edge Collection:
- Accept events from producers
- Minimal validation (malformed requests rejected)
- Immediate acknowledgment
- Load balancing across collectors
- Geographic distribution for latency
Stage 2: Buffering and Batching:
- Queue or stream buffer (Kafka, Kinesis)
- Absorbs traffic spikes
- Enables batching for efficiency
- Provides durability guarantee
- Decouples collection from processing
Stage 3: Validation and Enrichment:
- Schema validation
- Business rule checking
- Data enrichment (lookups, transformations)
- Dead letter queue for invalid events
- Parallel processing for throughput
Stage 4: Storage and Indexing:
- Batch writes to storage
- Update indexes
- Update aggregates
- Replicate to read stores
- Archive to cold storage
Architectural Benefits:
- Each stage optimized independently
- Failure in one stage doesn’t cascade
- Backpressure managed through queues
- Easy to add processing stages
Trade-offs:
- Higher complexity
- End-to-end latency increased
- More infrastructure components
- Operational overhead
Load Balancing Strategies
Distributing load across ingestion nodes:
Geographic Load Balancing:
- Route to nearest data center
- Reduces latency for producers
- Regional failure isolation
- Compliance with data residency
Consistent Hashing:
- Deterministic routing
- Minimal disruption on node changes
- Enables local caching
- Uneven distribution with skewed keys
Least Connections:
- Route to least busy node
- Better balance with variable request costs
- Requires connection state tracking
- More complex than round-robin
Weighted Round-Robin:
- Account for heterogeneous nodes
- Simple and predictable
- Static weights may not adapt to load
- Good for homogeneous infrastructure
The choice depends on client distribution, event size variation, and infrastructure homogeneity.
Storage Architecture for High Write Throughput
Write-Optimized Storage Patterns
Log-Structured Merge (LSM) Trees:
- Sequential writes to log
- Background compaction
- Excellent write throughput
- Read amplification trade-off
- Used by Cassandra, RocksDB, HBase
Append-Only Logs:
- Pure sequential writes
- No updates or deletes
- Compaction for cleanup
- Excellent write performance
- Limited query capabilities
Time-Series Databases:
- Optimized for time-ordered writes
- Downsampling for retention
- Efficient range queries
- Limited secondary indexes
- Examples: InfluxDB, TimescaleDB, ClickHouse
Partitioned Relational Databases:
- Time-based or key-based partitioning
- Partition pruning for queries
- Parallel writes to partitions
- More complex management
- Examples: PostgreSQL with partitioning, MySQL sharding
Partitioning Strategies
Time-Based Partitioning:
- One partition per time period (hour, day)
- Efficient for time-range queries
- Easy to archive old partitions
- May have hot partition for current time
- Predictable growth pattern
Hash-Based Partitioning:
- Hash entity ID to partition
- Even distribution
- Enables horizontal scaling
- Difficult to rebalance
- Range queries require all partitions
Range-Based Partitioning:
- Partition by entity ID ranges
- Enables range queries
- Uneven distribution with skewed data
- Easier rebalancing than hash
- Requires careful boundary selection
Composite Partitioning:
- Combine strategies (time + hash)
- Balance benefits of multiple approaches
- Time-based primary, hash-based secondary
- More complex but powerful
- Common at large scale
For event systems, composite time + entity hash partitioning often works best.
Hot Partition Problem
Skewed write distribution to partitions:
Causes:
- Popular entities (celebrities, trending topics)
- Time-based patterns (start of hour)
- Key hashing collisions
- Poor partition key choice
Mitigation Strategies:
Partition Splitting:
- Split hot partitions into multiple
- Redistribute keys
- Requires data migration
- Temporary disruption
Sub-Partitioning:
- Add random suffix to hot keys
- Spread across multiple partitions
- Requires gathering on read
- Effective for write-heavy, read-light
Quota Management:
- Rate limit high-volume entities
- Prevent single entity overwhelming system
- Business decision on fairness
- May require separate tier for high-volume
Separate Paths:
- Dedicated infrastructure for high-volume entities
- Isolates impact
- Higher complexity
- Economical for few high-volume entities
Processing Architecture Patterns
Parallel Processing with Exactly-Once Semantics
At scale, failures are continuous. Processing must handle duplicates.
Idempotency Patterns:
Unique Event IDs:
- Producer assigns unique ID
- Deduplicate based on ID
- Store processed IDs
- Bloom filter for quick check
Conditional Writes:
- Database supports conditional updates
- Check before modifying
- Atomic test-and-set
- Prevents duplicate effects
Versioning:
- Every state change has version number
- Only apply if version matches
- Prevents out-of-order updates
- Enables conflict detection
State Store for Deduplication:
- Track processed event IDs
- TTL for old entries
- Distributed cache (Redis, Memcached)
- Balance storage vs processing duplication
Backpressure and Flow Control
Preventing system overload:
Queue-Based Backpressure:
- Fixed queue size limits
- Producers block when queue full
- Provides natural pushback
- May cause producer timeouts
Rate Limiting:
- Limit ingestion rate
- Smooth traffic spikes
- May reject events during peaks
- Requires retry logic
Dynamic Scaling:
- Add processing capacity under load
- Auto-scaling based on queue depth
- Takes time to spin up (minutes)
- Cost-effective but not instant
Graceful Degradation:
- Sample events during overload
- Process subset, drop rest
- Maintain system stability
- Accept partial data loss
The architectural choice depends on whether you can afford to drop events or must process all.
Batch Processing Optimization
Processing efficiency improves with batching:
Micro-Batching:
- Accumulate events for short period (100ms-1s)
- Process batch together
- Amortize per-event overhead
- Balance latency and throughput
Batch Size Tuning:
- Too small: high overhead
- Too large: high latency, memory pressure
- Depends on event size and processing
- Typically 100-1000 events per batch
Batch Writing:
- Single write for batch vs per-event writes
- 10-100x performance improvement
- Requires transaction support or accepts partial failure
- Critical for database writes
Aggregation and Real-Time Analytics
Pre-Aggregation Patterns
Computing aggregates incrementally:
Streaming Aggregation:
- Update aggregates as events arrive
- Maintain state for active windows
- Emit results periodically
- Windowing strategies (tumbling, sliding, session)
Lambda Architecture for Aggregates:
- Real-time: approximate aggregates from stream
- Batch: accurate aggregates from historical data
- Merge at query time
- Balance freshness and accuracy
Materialized Views:
- Incrementally maintain aggregate tables
- Serve from pre-computed results
- Background recomputation for accuracy
- Trade storage for query speed
Approximate Aggregations:
- Sketches and probabilistic structures
- Count-min sketch for frequency
- HyperLogLog for cardinality
- Significant space savings, bounded error
At 100M events/day, pre-aggregation is essential. Cannot compute aggregates on-demand.
Time Window Management
Tumbling Windows:
- Fixed, non-overlapping time periods
- Each event in exactly one window
- Simple to implement
- Arbitrary boundaries (event at 12:00:00.001 vs 11:59:59.999)
Sliding Windows:
- Overlapping time periods
- Each event in multiple windows
- Smoother transitions
- Higher computation cost
Session Windows:
- Gap-based grouping
- Window ends after inactivity period
- Variable window sizes
- Useful for user sessions
Watermarking:
- Handle late-arriving events
- Determine when window is complete
- Trade-off between latency and completeness
- Critical for correctness
Fault Tolerance and Reliability
Failure Modes at Scale
With thousands of events per second, failures are constant:
Expected Failure Rates:
- Infrastructure: 0.1% of requests fail
- Network: transient errors common
- Dependencies: occasional unavailability
- At 1000/sec, that’s 1 failure/second
Design for Failure:
- Retries with exponential backoff
- Circuit breakers for failing dependencies
- Fallbacks and degraded modes
- Isolation to prevent cascade failures
Checkpointing and Recovery
Periodic Checkpointing:
- Save processing state regularly
- Balance checkpoint frequency vs recovery time
- Asynchronous checkpointing to avoid blocking
- Checkpoint storage durability critical
Event Log Replay:
- Reprocess from event log after failure
- Requires idempotent processing
- Can take hours for large backlogs
- Combine with checkpoints for faster recovery
Multi-Version Concurrency:
- Keep multiple versions of state
- Roll back to previous version on corruption
- Higher storage cost
- Enables fast recovery
Monitoring and Alerting at Scale
Key Metrics:
Throughput Metrics:
- Events per second (p50, p95, p99)
- Bytes per second
- Batch size distribution
- Trends over time
Latency Metrics:
- End-to-end latency
- Per-stage latency
- Percentiles (p95, p99, p99.9)
- Latency budgets
Error Metrics:
- Error rate by type
- Retry rate
- Dead letter queue size
- Downstream error propagation
Resource Metrics:
- CPU utilization
- Memory usage
- Network throughput
- Disk I/O
- Queue depths
Alert Thresholds:
- Anomaly-based rather than static
- Account for daily/weekly patterns
- Multiple severity levels
- Alert fatigue prevention
Performance Optimization Patterns
Hot Path Optimization
The ingestion path is critical:
Minimize Allocations:
- Object pooling for frequent allocations
- Reduce garbage collection pressure
- Pre-allocate buffers
- Zero-copy techniques where possible
Avoid Synchronous I/O:
- Async I/O for network and disk
- Non-blocking operations
- Callbacks or futures
- Event-driven architecture
Batch Database Operations:
- Bulk inserts instead of individual
- Prepared statements
- Connection pooling
- Reduce round-trips
Caching Frequently Accessed Data:
- Reference data in memory
- Configuration cached
- Lookups memoized
- Cache invalidation strategy
Network Optimization
Network often bottleneck at scale:
Protocol Selection:
- Binary protocols (Protobuf, Avro) over JSON
- Compression for large payloads
- HTTP/2 for multiplexing
- gRPC for service-to-service
Connection Management:
- Connection pooling and reuse
- Keep-alive connections
- Limit connection overhead
- Balance pool size vs resource usage
Payload Optimization:
- Remove unnecessary fields
- Compact representations
- Compression (gzip, zstd)
- Delta encoding for updates
Database Optimization
Database writes often the bottleneck:
Write Optimization:
- Batch writes (100-1000 records)
- Use bulk insert APIs
- Disable/defer index updates during bulk insert
- Partition for parallel writes
Index Strategy:
- Minimal indexes on high-write tables
- Covering indexes for frequent queries
- Defer index creation to batch process
- Partial indexes for subset queries
Connection Pooling:
- Reuse database connections
- Size pool for concurrency
- Monitor connection usage
- Avoid connection exhaustion
Cost Optimization at Scale
At 100M events/day, costs matter:
Infrastructure Right-Sizing:
- Profile actual resource usage
- Avoid over-provisioning
- Use auto-scaling for variable load
- Reserved capacity for baseline
Tiered Storage:
- Hot data (recent) on fast storage
- Warm data on standard storage
- Cold data (archive) on cheap storage
- Automatic lifecycle management
Compression:
- Compress stored events
- 5-10x size reduction typical
- CPU cost vs storage savings
- Especially valuable for long retention
Sampling and Aggregation:
- Store raw events for limited period
- Store aggregates for longer
- Reduces storage costs 10-100x
- Accept loss of granularity for old data
Conclusion
Building systems that process 100M+ events per day requires architectural patterns specifically designed for scale:
- Multi-stage ingestion with buffering to handle spikes and decouple stages
- Write-optimized storage with appropriate partitioning for throughput
- Idempotent processing to handle inevitable failures and duplicates
- Pre-aggregation for real-time analytics, cannot compute on-demand
- Comprehensive monitoring to understand behavior at scale
- Cost optimization as infrastructure costs become significant
The key insight: at this scale, you cannot reason about individual events. Architecture must handle events in aggregate, optimize for throughput over latency for individual events, and design for continuous partial failures.
Start with these patterns from the beginning. Retrofitting them into a system designed for lower scale is far more difficult than building them in from the start. The architecture that works at 1M events/day will not work at 100M. Plan for scale even if you’re not there yet.