NoSQL Storage Patterns and Architecture

NoSQL databases are gaining significant traction, driven by web-scale applications and big data requirements. But NoSQL is fundamentally different from traditional relational databases, especially in storage architecture. Let me explore these differences and what they mean for storage infrastructure.

Why NoSQL?

Traditional relational databases (RDBMS) have limitations at web scale:

Scaling: Relational databases scale vertically (bigger hardware) rather than horizontally (more servers). There are limits to vertical scaling.

Schema Rigidity: Fixed schemas make it hard to evolve data models rapidly.

Join Complexity: Joins across massive distributed datasets are expensive.

Transactions: ACID transactions are expensive to implement across distributed systems.

NoSQL databases relax some traditional database guarantees to achieve better scalability and flexibility.

The CAP Theorem

Understanding NoSQL requires understanding the CAP theorem:

Consistency: All nodes see the same data at the same time.

Availability: Every request receives a response (success or failure).

Partition Tolerance: System continues operating despite network partitions.

The CAP theorem states you can have at most two of these three properties.

Traditional databases choose Consistency and Availability but struggle with Partitions. NoSQL databases typically choose Availability and Partition tolerance but relax Consistency.

NoSQL Categories

NoSQL databases fall into several categories:

Key-Value Stores

Simple model: key → value lookups.

Examples: Redis, Riak, Dynamo

Storage Pattern: Hash table, often distributed across nodes via consistent hashing.

Use Cases: Caching, session storage, shopping carts.

Characteristics: Very fast, simple, limited query capability.

Document Stores

Store documents (JSON, XML) with flexible schemas.

Examples: MongoDB, CouchDB

Storage Pattern: B-trees or LSM trees, indexed by document ID and fields.

Use Cases: Content management, catalogs, user profiles.

Characteristics: Flexible schema, rich queries, horizontal scaling.

Column-Family Stores

Store data in column families rather than rows.

Examples: Cassandra, HBase

Storage Pattern: Sorted maps, often LSM-tree based.

Use Cases: Time-series data, analytics, high write throughput.

Characteristics: Excellent write performance, good for sparse data.

Graph Databases

Optimize for graph traversal and relationships.

Examples: Neo4j, InfiniteGraph

Storage Pattern: Adjacent nodes stored nearby for efficient traversal.

Use Cases: Social networks, recommendation engines, fraud detection.

Characteristics: Excellent for relationship-heavy data, not for analytics.

Each category has different storage architecture optimized for its access patterns.

Storage Architecture Patterns

Log-Structured Merge Trees (LSM)

Many NoSQL databases use LSM trees:

Write Path:

Writes go to memory (memtable)
When memtable fills, flush to disk as sorted file (SSTable)
Background processes merge SSTables to compact data

Read Path:

Check memtable first
Check SSTables from newest to oldest
May need to check multiple files (read amplification)

Benefits: Excellent write performance, writes are sequential.

Costs: Reads can be slower, compaction creates I/O overhead.

Cassandra, HBase, and LevelDB use LSM trees.

B-Trees

Some NoSQL databases use B-trees, like traditional databases:

Benefits: Good read performance, predictable.

Costs: Random writes, write amplification from B-tree updates.

MongoDB uses B-trees (specifically B+ trees).

Consistent Hashing

Distributed NoSQL databases use consistent hashing to partition data:

Approach: Hash keys to a ring. Assign nodes to points on the ring. Each node owns the range from itself to the previous node.

Benefits: Adding/removing nodes only affects adjacent nodes, not entire cluster.

Virtual Nodes: Each physical node owns multiple virtual nodes for better distribution.

Dynamo, Cassandra, and Riak use consistent hashing.

Replication Strategies

NoSQL databases replicate data for reliability and availability:

Master-Slave Replication

One master accepts writes, replicates to slaves.

Benefits: Simple, read scaling by adding slaves.

Costs: Master is single point of failure for writes.

MongoDB uses master-slave replication.

Multi-Master Replication

Multiple nodes accept writes.

Benefits: No single point of failure, geographical distribution.

Costs: Conflict resolution needed for concurrent writes.

Cassandra and Riak use multi-master replication.

Quorum-Based Replication

Configurable consistency: R + W > N ensures consistency.

N: Total replicas
R: Nodes that must respond to read
W: Nodes that must acknowledge write

Example: N=3, R=2, W=2. Reads see latest write because at least one node in the read quorum participated in the write quorum.

Cassandra and Riak support tunable quorum-based consistency.

Storage I/O Patterns

NoSQL databases have different I/O patterns than RDBMS:

Write-Heavy Workloads: Many NoSQL databases optimize for write throughput. LSM trees make writes sequential.

Large Values: Document stores may store large documents (megabytes).

Time-Series: Column stores often handle time-series data with append-heavy patterns.

Read Patterns: Some NoSQL databases scan large datasets (HBase for analytics) while others do point lookups (Redis for caching).

Understanding your specific NoSQL database’s I/O pattern is essential for storage design.

Storage Infrastructure Implications

Disk Configuration

NoSQL databases often prefer:

Many Spindles: More disks mean more I/O parallelism.

SSDs for Certain Workloads: Read-heavy NoSQL databases benefit from SSD. Write-heavy may not benefit as much.

Separate Commit Logs: Some databases write commit logs to separate disks for better performance.

JBOD vs. RAID: Depends on database. Some (like Cassandra) provide their own replication and prefer JBOD. Others (like MongoDB) benefit from RAID.

Memory

NoSQL databases are often memory-hungry:

Caching: Large caches improve read performance significantly.

Memtables: LSM-tree databases use memory for write buffering.

Bloom Filters: Used to avoid disk reads for non-existent keys.

Memory is often more important than disk for NoSQL performance.

Network

Distributed NoSQL databases generate significant network traffic:

Replication: Data is replicated across nodes.

Queries: Queries may hit multiple nodes.

Anti-Entropy: Background processes synchronize replicas.

10 Gigabit Ethernet is becoming standard for NoSQL clusters.

Consistency Models

NoSQL databases offer various consistency models:

Eventual Consistency: Replicas eventually converge, but may temporarily diverge. Provides high availability.

Read-Your-Writes: After writing, that client will see its writes. Other clients may not immediately.

Session Consistency: Within a session, consistency guarantees are stronger.

Strong Consistency: All clients see the same data at the same time. Reduces availability.

Applications must be designed for the consistency model the database provides.

Data Modeling Differences

NoSQL data modeling differs from relational:

Denormalization: Duplicate data to avoid joins. Storage is cheaper than joins.

Nesting: Embed related data within documents rather than separate tables.

Application-Side Joins: If you need joins, do them in application code, not database.

Schema Design for Queries: Design schema based on query patterns, not normalized form.

This is a significant shift for those used to relational data modeling.

Backup and Recovery

NoSQL backup is different:

Snapshot-Based: Take snapshots of data directories.

Streaming: Stream data to backup location continuously.

Cross-Datacenter Replication: Use replication for disaster recovery rather than traditional backup.

Point-in-Time Recovery: May not be available. NoSQL databases often don’t have traditional transaction logs.

Ensure your NoSQL database’s backup capabilities meet your requirements.

Monitoring

Monitor NoSQL databases differently than RDBMS:

Cluster Health: Monitor all nodes, not just one server.

Replication Lag: How far behind are replicas?

Compaction: Monitor compaction progress and I/O overhead.

Cache Hit Rates: Cache effectiveness significantly impacts performance.

Key Distribution: Ensure data is distributed evenly across nodes.

Each NoSQL database has specific metrics to monitor.

When NoSQL Makes Sense

NoSQL is appropriate for:

Massive Scale: Horizontal scaling to hundreds or thousands of nodes.

Flexible Schema: Rapidly evolving data models.

High Availability: Applications that prioritize availability over consistency.

Specific Access Patterns: Key-value lookups, time-series, graphs—workloads that match NoSQL strengths.

NoSQL doesn’t make sense for:

Complex Transactions: Multi-table transactions with ACID guarantees.

Ad-Hoc Queries: Complex queries not known at design time.

Strong Consistency Requirements: Where eventual consistency is unacceptable.

Small Scale: Overhead not worthwhile for small datasets.

Polyglot Persistence

The future is likely polyglot persistence—using different databases for different needs:

RDBMS: Transactional data, complex queries.

Document Store: Flexible content, catalogs.

Key-Value: Caching, sessions.

Column Store: Analytics, time-series.

Graph: Relationships, social networks.

Use the right database for each workload rather than one database for everything.

Storage Virtualization and NoSQL

How does storage virtualization fit with NoSQL?

Less Relevant: NoSQL databases often use direct-attached storage and provide their own virtualization (via replication and distribution).

Integration Points: Data movement between NoSQL and traditional storage for analytics or backup.

Hybrid Approaches: Some organizations use SANs for smaller NoSQL deployments where operational simplicity outweighs scale-out benefits.

Generally, NoSQL architectural benefits are maximized with direct-attached storage.

Performance Considerations

NoSQL performance optimization:

Hardware: Adequate CPU, memory, disk, and network for your workload.

Data Modeling: Design schema for your access patterns.

Consistency Tuning: Lower consistency levels improve performance and availability.

Caching: Large caches dramatically improve read performance.

Compaction Tuning: Balance compaction overhead with read amplification.

Performance tuning is database-specific and requires deep understanding of your particular NoSQL database.

The Evolution

NoSQL is evolving rapidly:

Adding SQL: Many NoSQL databases add SQL-like query languages for easier use.

Stronger Consistency: Options for stronger consistency when needed.

Better Tools: Improved management, monitoring, and operational tools.

Hybrid Databases: Databases that support both document and relational models.

The line between NoSQL and traditional databases is blurring.

Conclusion

NoSQL databases represent different trade-offs than traditional relational databases. By relaxing certain guarantees (schema, consistency, transactions), they achieve better scalability and availability.

Understanding NoSQL storage architecture—LSM trees, consistent hashing, replication strategies—is increasingly important. As applications scale, NoSQL becomes more common.

Working on traditional storage at Cisco, I see NoSQL as complementary rather than competitive. Different tools for different jobs. The best architecture uses appropriate databases for each workload.

NoSQL isn’t a panacea, but for specific use cases—massive scale, flexible schema, high availability—it provides capabilities traditional databases struggle with.

As storage professionals, understanding both traditional and NoSQL storage architectures makes us more versatile and valuable. The future is heterogeneous, with multiple storage and database technologies coexisting in the same infrastructure.