2022 Reflections: Architectural Lessons from Scaling to 100M+ Events Daily

As 2022 comes to a close, I’m reflecting on a year spent operating systems at a scale I hadn’t experienced before: 60+ microservices, 28,000+ customers, 400,000 monthly active users, and over 100 million events processed daily. This year taught me that architectural principles don’t change with scale—they become more important. Here are the most valuable lessons.

Lesson 1: Observability is Non-Negotiable Architecture

At smaller scales, you can debug production issues by reading logs and restarting services. At scale, that’s impossible.

The Cost of Poor Observability

Early in the year, we experienced a latency spike affecting 5% of requests. Without proper distributed tracing, we spent three days correlating logs across services. The issue? A dependency timeout in a service 7 hops deep in the call chain. With distributed tracing, we would have identified it in minutes.

The architectural lesson: observability must be designed in from the start, not added later.

What Works:

Structured logging with correlation IDs from day one
Distributed tracing for all service-to-service calls
Metrics with high cardinality support (not just aggregates)
Service dependency graphs updated automatically
Real-time dashboards for key user journeys

What Doesn’t:

Adding observability “when we have time”
Text-based logs without structure
Metrics without dimensions
Separate tools that don’t integrate
Manual service dependency documentation

The investment in observability pays for itself the first time you debug a production issue in minutes instead of days.

Lesson 2: Latency is an Architectural Property, Not a Performance Tuning Exercise

We started the year with p95 latency around 400ms. After months of optimization, we brought it down to 50ms. The key insight: the biggest wins came from architectural changes, not code optimization.

Architectural Patterns That Reduced Latency

Caching at Multiple Levels:

L1: In-process cache (microseconds)
L2: Distributed cache (milliseconds)
L3: Database (tens of milliseconds)

This multi-tier approach reduced database load by 90% and improved p95 latency by 150ms.

Asynchronous Processing:

Synchronous path: only what’s needed for response
Asynchronous path: analytics, notifications, non-critical updates
User doesn’t wait for background tasks

Moved 70% of processing off the critical path, reducing p95 by 100ms.

Database Denormalization:

Pre-joined data for common queries
Materialized views for complex aggregations
Accept eventual consistency for non-critical reads

Reduced complex joins, improving p95 by 50ms.

CDN for Static Content:

Edge caching for geographic distribution
Reduced backend calls for cacheable content
Improved response time for distant users

The pattern: optimize the architecture first (caching, async, denormalization), then optimize code. You cannot code your way out of architectural latency problems.

Lesson 3: Event-Driven Architecture Scales, But Requires Discipline

Event-driven architecture enabled us to scale from 10M to 100M+ events per day without major rewrites. But it came with challenges.

What We Got Right

Clear Event Schemas:

Schema registry from the start
Versioned event definitions
Backward compatibility requirements
Automated validation

Idempotent Event Handlers:

Every handler designed for duplicate events
Deduplication built into architecture
No assumptions about exactly-once delivery

Dead Letter Queues:

Separate queue for failed events
Automatic retry with exponential backoff
Manual review process for persistent failures

What Bit Us

Event Chain Complexity:

One event triggered another, which triggered another
Debugging cascading failures was nightmare
Hard to reason about end-to-end flows

We learned to limit event chains to 3-4 hops maximum. Beyond that, use orchestration or choreograph differently.

Schema Evolution Mistakes:

Removing fields without deprecation period
Changing field types without version bump
Insufficient testing of mixed versions

Now we have automated tests ensuring producers and consumers of different versions can interoperate.

Unbounded Event Size:

Some events grew to megabytes
Overwhelmed message brokers
Had to implement size limits and move large payloads to object storage

The lesson: event-driven architecture is powerful, but requires architectural guardrails to prevent chaos.

Lesson 4: Microservices Organization Follows Conway’s Law

We reorganized teams midyear to align with service boundaries. The impact was dramatic.

Before: Function-Based Teams

Frontend team
Backend team
Data team
Infrastructure team

Results: Every feature required coordination across all teams. Velocity was slow.

After: Domain-Based Teams

User domain team (owns auth, profile, preferences services)
Activity domain team (owns events, analytics, insights services)
Notification domain team (owns email, push, preferences services)
Platform team (owns infrastructure, observability, common libraries)

Results: Teams could deploy independently. Velocity increased significantly.

Key Architectural Insights

Service boundaries should match team boundaries:

If a service is split across teams, coordination overhead kills velocity
If a team owns too many services, cognitive load is overwhelming
Sweet spot: 3-7 services per team of 6-8 engineers

Platform team enables domain teams:

Platform builds self-service infrastructure
Domain teams focus on business logic
Clear API boundaries between platform and domain services
Platform team should be 10-20% of total engineering

Communication patterns predict coupling:

Teams that talk frequently can own tightly coupled services
Teams that rarely talk need loose coupling (events, not RPC)
Organizational structure is architectural blueprint

Lesson 5: Operational Excellence is an Architectural Decision

We spent significant time this year reducing operational burden. The biggest impact came from architectural choices, not better runbooks.

Architectural Patterns for Operational Excellence

Automated Remediation:

Circuit breakers close automatically
Auto-scaling handles load spikes
Health checks trigger instance replacement
Fewer manual interventions

Immutable Infrastructure:

No SSH into servers to debug
No manual configuration changes
All changes through deployment pipeline
Easier to reason about system state

Declarative Configuration:

Desired state declared in code
System converges to desired state
Rollback is changing desired state
Git history is audit trail

Gradual Rollouts:

Canary deployments (1%, 10%, 50%, 100%)
Automatic rollback on error rate increase
Feature flags for runtime control
Reduced blast radius of bugs

The insight: if your architecture requires constant manual intervention, the architecture is wrong. Automate the operations or redesign the system.

Lesson 6: Data Architecture Determines Query Performance

We learned this the hard way when analytics queries started timing out.

The Problem

Initially, we stored all events in a single append-only table. As volume grew to billions of rows, queries slowed to minutes or hours.

The Solution: Architectural Redesign

Time-Based Partitioning:

One partition per day
Queries specify date range
Partition pruning eliminates 99% of data
Queries went from minutes to seconds

Columnar Storage:

Switched from row-oriented to column-oriented
Compressed columnar format (Parquet)
10x reduction in storage
100x improvement in analytical query speed

Pre-Aggregated Rollups:

Hourly, daily, and monthly aggregates
Computed incrementally
Serve from rollups for date ranges
Raw data only for detailed drill-down

Tiered Storage:

Recent data (7 days) on fast SSD
Medium-term (30 days) on standard storage
Long-term (1+ year) on cheap archive storage
70% cost reduction

The pattern: data architecture (partitioning, storage format, tiering) has more impact on query performance than indexes or query optimization.

Lesson 7: Resilience Requires Designing for Failure

At scale, failures are continuous. Our architecture evolved to embrace failure rather than trying to prevent it.

Failure Patterns We Handle

Service Failures:

Any service can fail at any time
Circuit breakers prevent cascade
Fallbacks provide degraded service
System remains partially functional

Network Failures:

Transient errors are common
Automatic retries with exponential backoff
Timeouts prevent indefinite waiting
Async communication where possible

Data Center Failures:

Multi-zone deployment
Automatic failover
No single point of failure
Accepts eventual consistency

Architectural Patterns for Resilience

Bulkheading:

Isolate failures to components
Separate thread pools per dependency
Prevent one failure from exhausting resources
Contain blast radius

Graceful Degradation:

Identify critical vs non-critical functionality
Non-critical can fail without affecting critical
Serve partial responses rather than errors
User experience degrades, doesn’t break

Chaos Engineering:

Regularly inject failures
Verify resilience mechanisms work
Find weaknesses before they cause outages
Build confidence in system resilience

The key lesson: resilient architecture assumes failure and designs around it, rather than trying to prevent all failures.

Lesson 8: Cost Optimization is Continuous Architectural Work

At 100M+ events per day, infrastructure costs became significant. Optimization required architectural thinking.

Biggest Cost Wins

Right-Sizing Compute:

Most services over-provisioned
Profiling revealed actual CPU/memory usage
Right-sized instances saved 40% on compute
Auto-scaling handles variability

Storage Tiering:

Lifecycle policies move old data to cheaper storage
70% of data rarely accessed
Tiering saved 60% on storage costs
No impact on performance for common queries

Compression:

Events compressed before storage
8x size reduction on average
CPU cost minimal (modern compression is fast)
Reduced storage and network costs

Sampling:

Not all events need full processing
Sample non-critical events
10x cost reduction for analytics pipelines
Statistically valid for aggregate metrics

The insight: cost optimization isn’t about buying cheaper servers. It’s about architectural efficiency—doing less work, storing less data, moving data less frequently.

Lesson 9: Platform Engineering Accelerates Feature Teams

We invested heavily in platform capabilities this year. The ROI was enormous.

Platform Investments

Self-Service Infrastructure:

Service templates with best practices baked in
One command to create new service
Observability, security, deployment included
New services production-ready in hours, not weeks

Developer Tooling:

Local development environment
Integration test framework
Load testing infrastructure
Reduced friction for engineers

Golden Paths:

Documented patterns for common scenarios
Reference implementations
Automated validation of compliance
Easy to do the right thing

The architectural principle: platform capabilities are force multipliers. One platform engineer enabling 10 feature engineers is high leverage.

Lesson 10: Documentation is Architecture

Finally, the most surprising lesson: documentation quality directly correlates with architectural quality.

Systems with Poor Documentation:

Complex, hard to understand
Implicit assumptions everywhere
Fear of changing anything
Tribal knowledge required

Systems with Good Documentation:

Clear abstractions and boundaries
Explicit about trade-offs
Easier to reason about
Easier to change

Documentation doesn’t make bad architecture good. But documenting forces you to articulate design decisions, which exposes bad architecture. If you can’t explain it clearly, it’s probably too complex.

Looking Forward to 2023

These lessons shape how I’ll approach architecture in 2023:

Invest in observability from day one - not negotiable
Design for latency at the architecture level - caching, async, denormalization
Embrace event-driven, but with guardrails - schema management, bounded chains
Align teams and services - Conway’s Law is a feature, not a bug
Automate operations - if it requires manual intervention, fix the architecture
Design data architecture for queries - partitioning, storage format, tiering
Assume failure - resilience through architecture, not perfection
Optimize costs continuously - efficiency is architectural
Build platforms - enable feature teams with self-service infrastructure
Document ruthlessly - forces clear thinking about design

The common thread: architecture isn’t just about technology choices. It’s about enabling teams to move quickly while maintaining reliability at scale. The best architecture is invisible—it just works, and teams can focus on delivering value.

Here’s to scaling further in 2023 while keeping the architecture simple, the systems reliable, and the teams productive.