Processing 100 million events per day means handling over 1,000 events per second continuously, with peaks potentially 10x higher. At this scale, architectural decisions that seem minor at lower volumes become critical. After building and operating systems at this scale, I’ve learned that the architecture must be designed for failure, optimized for the common case, and instrumented for understanding behavior under load.

Scale Characteristics and Constraints

Understanding the Numbers

100 million events per day translates to:

  • Average: 1,157 events/second
  • Peak: 5,000-10,000 events/second (assuming 5-10x peak factor)
  • Daily data volume: 100GB-1TB depending on event size
  • Monthly data: 3-30TB
  • Annual retention: 36-360TB

These numbers drive architectural decisions:

  • Database writes: thousands per second
  • Network throughput: sustained gigabits per second
  • Storage: terabytes to petabytes
  • Processing: thousands of CPU cores

System Constraints at Scale

Write Throughput Dominance:

  • Reads are typically less than 10% of writes
  • Cannot rely on caching for writes
  • Write path is the critical path
  • Optimization focuses on ingestion

Partitioning Required:

  • Single-node systems cannot handle this load
  • Horizontal partitioning essential
  • Careful partition key selection critical
  • Rebalancing strategies needed

Eventual Consistency Acceptable:

  • Strong consistency too expensive at scale
  • Eventual consistency with bounded lag
  • Design for idempotency
  • Compensating transactions for corrections

Cost Constraints:

  • Infrastructure costs become significant
  • Efficiency directly impacts profitability
  • Resource optimization crucial
  • Monitoring costs themselves become non-trivial

Ingestion Architecture Patterns

Multi-Stage Ingestion Pipeline

High-throughput ingestion requires staged processing:

Stage 1: Edge Collection:

  • Accept events from producers
  • Minimal validation (malformed requests rejected)
  • Immediate acknowledgment
  • Load balancing across collectors
  • Geographic distribution for latency

Stage 2: Buffering and Batching:

  • Queue or stream buffer (Kafka, Kinesis)
  • Absorbs traffic spikes
  • Enables batching for efficiency
  • Provides durability guarantee
  • Decouples collection from processing

Stage 3: Validation and Enrichment:

  • Schema validation
  • Business rule checking
  • Data enrichment (lookups, transformations)
  • Dead letter queue for invalid events
  • Parallel processing for throughput

Stage 4: Storage and Indexing:

  • Batch writes to storage
  • Update indexes
  • Update aggregates
  • Replicate to read stores
  • Archive to cold storage

Architectural Benefits:

  • Each stage optimized independently
  • Failure in one stage doesn’t cascade
  • Backpressure managed through queues
  • Easy to add processing stages

Trade-offs:

  • Higher complexity
  • End-to-end latency increased
  • More infrastructure components
  • Operational overhead

Load Balancing Strategies

Distributing load across ingestion nodes:

Geographic Load Balancing:

  • Route to nearest data center
  • Reduces latency for producers
  • Regional failure isolation
  • Compliance with data residency

Consistent Hashing:

  • Deterministic routing
  • Minimal disruption on node changes
  • Enables local caching
  • Uneven distribution with skewed keys

Least Connections:

  • Route to least busy node
  • Better balance with variable request costs
  • Requires connection state tracking
  • More complex than round-robin

Weighted Round-Robin:

  • Account for heterogeneous nodes
  • Simple and predictable
  • Static weights may not adapt to load
  • Good for homogeneous infrastructure

The choice depends on client distribution, event size variation, and infrastructure homogeneity.

Storage Architecture for High Write Throughput

Write-Optimized Storage Patterns

Log-Structured Merge (LSM) Trees:

  • Sequential writes to log
  • Background compaction
  • Excellent write throughput
  • Read amplification trade-off
  • Used by Cassandra, RocksDB, HBase

Append-Only Logs:

  • Pure sequential writes
  • No updates or deletes
  • Compaction for cleanup
  • Excellent write performance
  • Limited query capabilities

Time-Series Databases:

  • Optimized for time-ordered writes
  • Downsampling for retention
  • Efficient range queries
  • Limited secondary indexes
  • Examples: InfluxDB, TimescaleDB, ClickHouse

Partitioned Relational Databases:

  • Time-based or key-based partitioning
  • Partition pruning for queries
  • Parallel writes to partitions
  • More complex management
  • Examples: PostgreSQL with partitioning, MySQL sharding

Partitioning Strategies

Time-Based Partitioning:

  • One partition per time period (hour, day)
  • Efficient for time-range queries
  • Easy to archive old partitions
  • May have hot partition for current time
  • Predictable growth pattern

Hash-Based Partitioning:

  • Hash entity ID to partition
  • Even distribution
  • Enables horizontal scaling
  • Difficult to rebalance
  • Range queries require all partitions

Range-Based Partitioning:

  • Partition by entity ID ranges
  • Enables range queries
  • Uneven distribution with skewed data
  • Easier rebalancing than hash
  • Requires careful boundary selection

Composite Partitioning:

  • Combine strategies (time + hash)
  • Balance benefits of multiple approaches
  • Time-based primary, hash-based secondary
  • More complex but powerful
  • Common at large scale

For event systems, composite time + entity hash partitioning often works best.

Hot Partition Problem

Skewed write distribution to partitions:

Causes:

  • Popular entities (celebrities, trending topics)
  • Time-based patterns (start of hour)
  • Key hashing collisions
  • Poor partition key choice

Mitigation Strategies:

Partition Splitting:

  • Split hot partitions into multiple
  • Redistribute keys
  • Requires data migration
  • Temporary disruption

Sub-Partitioning:

  • Add random suffix to hot keys
  • Spread across multiple partitions
  • Requires gathering on read
  • Effective for write-heavy, read-light

Quota Management:

  • Rate limit high-volume entities
  • Prevent single entity overwhelming system
  • Business decision on fairness
  • May require separate tier for high-volume

Separate Paths:

  • Dedicated infrastructure for high-volume entities
  • Isolates impact
  • Higher complexity
  • Economical for few high-volume entities

Processing Architecture Patterns

Parallel Processing with Exactly-Once Semantics

At scale, failures are continuous. Processing must handle duplicates.

Idempotency Patterns:

Unique Event IDs:

  • Producer assigns unique ID
  • Deduplicate based on ID
  • Store processed IDs
  • Bloom filter for quick check

Conditional Writes:

  • Database supports conditional updates
  • Check before modifying
  • Atomic test-and-set
  • Prevents duplicate effects

Versioning:

  • Every state change has version number
  • Only apply if version matches
  • Prevents out-of-order updates
  • Enables conflict detection

State Store for Deduplication:

  • Track processed event IDs
  • TTL for old entries
  • Distributed cache (Redis, Memcached)
  • Balance storage vs processing duplication

Backpressure and Flow Control

Preventing system overload:

Queue-Based Backpressure:

  • Fixed queue size limits
  • Producers block when queue full
  • Provides natural pushback
  • May cause producer timeouts

Rate Limiting:

  • Limit ingestion rate
  • Smooth traffic spikes
  • May reject events during peaks
  • Requires retry logic

Dynamic Scaling:

  • Add processing capacity under load
  • Auto-scaling based on queue depth
  • Takes time to spin up (minutes)
  • Cost-effective but not instant

Graceful Degradation:

  • Sample events during overload
  • Process subset, drop rest
  • Maintain system stability
  • Accept partial data loss

The architectural choice depends on whether you can afford to drop events or must process all.

Batch Processing Optimization

Processing efficiency improves with batching:

Micro-Batching:

  • Accumulate events for short period (100ms-1s)
  • Process batch together
  • Amortize per-event overhead
  • Balance latency and throughput

Batch Size Tuning:

  • Too small: high overhead
  • Too large: high latency, memory pressure
  • Depends on event size and processing
  • Typically 100-1000 events per batch

Batch Writing:

  • Single write for batch vs per-event writes
  • 10-100x performance improvement
  • Requires transaction support or accepts partial failure
  • Critical for database writes

Aggregation and Real-Time Analytics

Pre-Aggregation Patterns

Computing aggregates incrementally:

Streaming Aggregation:

  • Update aggregates as events arrive
  • Maintain state for active windows
  • Emit results periodically
  • Windowing strategies (tumbling, sliding, session)

Lambda Architecture for Aggregates:

  • Real-time: approximate aggregates from stream
  • Batch: accurate aggregates from historical data
  • Merge at query time
  • Balance freshness and accuracy

Materialized Views:

  • Incrementally maintain aggregate tables
  • Serve from pre-computed results
  • Background recomputation for accuracy
  • Trade storage for query speed

Approximate Aggregations:

  • Sketches and probabilistic structures
  • Count-min sketch for frequency
  • HyperLogLog for cardinality
  • Significant space savings, bounded error

At 100M events/day, pre-aggregation is essential. Cannot compute aggregates on-demand.

Time Window Management

Tumbling Windows:

  • Fixed, non-overlapping time periods
  • Each event in exactly one window
  • Simple to implement
  • Arbitrary boundaries (event at 12:00:00.001 vs 11:59:59.999)

Sliding Windows:

  • Overlapping time periods
  • Each event in multiple windows
  • Smoother transitions
  • Higher computation cost

Session Windows:

  • Gap-based grouping
  • Window ends after inactivity period
  • Variable window sizes
  • Useful for user sessions

Watermarking:

  • Handle late-arriving events
  • Determine when window is complete
  • Trade-off between latency and completeness
  • Critical for correctness

Fault Tolerance and Reliability

Failure Modes at Scale

With thousands of events per second, failures are constant:

Expected Failure Rates:

  • Infrastructure: 0.1% of requests fail
  • Network: transient errors common
  • Dependencies: occasional unavailability
  • At 1000/sec, that’s 1 failure/second

Design for Failure:

  • Retries with exponential backoff
  • Circuit breakers for failing dependencies
  • Fallbacks and degraded modes
  • Isolation to prevent cascade failures

Checkpointing and Recovery

Periodic Checkpointing:

  • Save processing state regularly
  • Balance checkpoint frequency vs recovery time
  • Asynchronous checkpointing to avoid blocking
  • Checkpoint storage durability critical

Event Log Replay:

  • Reprocess from event log after failure
  • Requires idempotent processing
  • Can take hours for large backlogs
  • Combine with checkpoints for faster recovery

Multi-Version Concurrency:

  • Keep multiple versions of state
  • Roll back to previous version on corruption
  • Higher storage cost
  • Enables fast recovery

Monitoring and Alerting at Scale

Key Metrics:

Throughput Metrics:

  • Events per second (p50, p95, p99)
  • Bytes per second
  • Batch size distribution
  • Trends over time

Latency Metrics:

  • End-to-end latency
  • Per-stage latency
  • Percentiles (p95, p99, p99.9)
  • Latency budgets

Error Metrics:

  • Error rate by type
  • Retry rate
  • Dead letter queue size
  • Downstream error propagation

Resource Metrics:

  • CPU utilization
  • Memory usage
  • Network throughput
  • Disk I/O
  • Queue depths

Alert Thresholds:

  • Anomaly-based rather than static
  • Account for daily/weekly patterns
  • Multiple severity levels
  • Alert fatigue prevention

Performance Optimization Patterns

Hot Path Optimization

The ingestion path is critical:

Minimize Allocations:

  • Object pooling for frequent allocations
  • Reduce garbage collection pressure
  • Pre-allocate buffers
  • Zero-copy techniques where possible

Avoid Synchronous I/O:

  • Async I/O for network and disk
  • Non-blocking operations
  • Callbacks or futures
  • Event-driven architecture

Batch Database Operations:

  • Bulk inserts instead of individual
  • Prepared statements
  • Connection pooling
  • Reduce round-trips

Caching Frequently Accessed Data:

  • Reference data in memory
  • Configuration cached
  • Lookups memoized
  • Cache invalidation strategy

Network Optimization

Network often bottleneck at scale:

Protocol Selection:

  • Binary protocols (Protobuf, Avro) over JSON
  • Compression for large payloads
  • HTTP/2 for multiplexing
  • gRPC for service-to-service

Connection Management:

  • Connection pooling and reuse
  • Keep-alive connections
  • Limit connection overhead
  • Balance pool size vs resource usage

Payload Optimization:

  • Remove unnecessary fields
  • Compact representations
  • Compression (gzip, zstd)
  • Delta encoding for updates

Database Optimization

Database writes often the bottleneck:

Write Optimization:

  • Batch writes (100-1000 records)
  • Use bulk insert APIs
  • Disable/defer index updates during bulk insert
  • Partition for parallel writes

Index Strategy:

  • Minimal indexes on high-write tables
  • Covering indexes for frequent queries
  • Defer index creation to batch process
  • Partial indexes for subset queries

Connection Pooling:

  • Reuse database connections
  • Size pool for concurrency
  • Monitor connection usage
  • Avoid connection exhaustion

Cost Optimization at Scale

At 100M events/day, costs matter:

Infrastructure Right-Sizing:

  • Profile actual resource usage
  • Avoid over-provisioning
  • Use auto-scaling for variable load
  • Reserved capacity for baseline

Tiered Storage:

  • Hot data (recent) on fast storage
  • Warm data on standard storage
  • Cold data (archive) on cheap storage
  • Automatic lifecycle management

Compression:

  • Compress stored events
  • 5-10x size reduction typical
  • CPU cost vs storage savings
  • Especially valuable for long retention

Sampling and Aggregation:

  • Store raw events for limited period
  • Store aggregates for longer
  • Reduces storage costs 10-100x
  • Accept loss of granularity for old data

Conclusion

Building systems that process 100M+ events per day requires architectural patterns specifically designed for scale:

  • Multi-stage ingestion with buffering to handle spikes and decouple stages
  • Write-optimized storage with appropriate partitioning for throughput
  • Idempotent processing to handle inevitable failures and duplicates
  • Pre-aggregation for real-time analytics, cannot compute on-demand
  • Comprehensive monitoring to understand behavior at scale
  • Cost optimization as infrastructure costs become significant

The key insight: at this scale, you cannot reason about individual events. Architecture must handle events in aggregate, optimize for throughput over latency for individual events, and design for continuous partial failures.

Start with these patterns from the beginning. Retrofitting them into a system designed for lower scale is far more difficult than building them in from the start. The architecture that works at 1M events/day will not work at 100M. Plan for scale even if you’re not there yet.