Storage Considerations for Hadoop and Big Data Workloads

Hadoop is transforming how organizations handle big data. But Hadoop’s storage model is fundamentally different from traditional enterprise storage. Understanding these differences is critical for anyone working in storage. Let me explore what makes Hadoop storage unique and the implications for storage architecture.

The Hadoop Storage Philosophy

Hadoop Distributed File System (HDFS) embodies a different philosophy than traditional storage:

Commodity Hardware: Use cheap servers with local disks instead of expensive SANs.

Software Reliability: Handle failures in software rather than preventing them with expensive hardware.

Scale-Out: Add more nodes rather than bigger nodes.

Data Locality: Move computation to data rather than moving data to computation.

Large Files: Optimize for large files (gigabytes to terabytes), not millions of small files.

This philosophy drives radically different storage architecture.

HDFS Architecture

HDFS consists of:

NameNode: Manages metadata—which files exist, which blocks compose each file, where blocks are stored. Single point of coordination.

DataNodes: Store actual data blocks. Typically one per server, using local disks.

Blocks: Files are divided into large blocks (default 64 MB or 128 MB). Each block is replicated to multiple DataNodes (default 3 replicas).

Clients: Applications access HDFS through client libraries that talk to NameNode for metadata and DataNodes for data.

This architecture provides scalability that traditional file systems can’t match.

Replication for Reliability

HDFS achieves reliability through replication:

3-Way Replication: Each block is stored on three DataNodes. If one node fails, two copies remain.

Replica Placement: Replicas are placed strategically—one on local rack, one on different rack, one on same rack. This balances reliability (survives rack failure) with performance (some replicas are nearby).

Automatic Re-replication: When a DataNode fails, NameNode detects it and instructs other DataNodes to create new replicas.

This approach accepts that hardware will fail and handles it automatically in software.

Compare to enterprise storage where we use RAID, redundant controllers, and high-quality components to prevent failures. Different philosophies, both valid for their use cases.

Storage Hardware for Hadoop

Hadoop clusters use specific hardware configurations:

Disk Configuration:

12-24 disks per server
SATA drives (7.2K RPM) for cost-effectiveness
JBOD (Just a Bunch of Disks)—no RAID
Each disk presents as separate device to HDFS

Why No RAID?: HDFS replication provides redundancy. RAID adds cost and complexity without benefit. If a disk fails, HDFS re-replicates affected blocks from other nodes.

CPU and Memory: Hadoop nodes need adequate CPU and memory for MapReduce jobs, not just storage.

Network: 10 Gigabit Ethernet is becoming standard. Hadoop generates significant network traffic during data replication and MapReduce shuffles.

The economics favor many cheap servers over fewer expensive ones.

I/O Patterns

Hadoop’s I/O patterns differ from traditional workloads:

Large Sequential Reads: MapReduce jobs scan large datasets sequentially.

Large Sequential Writes: Output from MapReduce jobs is written sequentially.

Append-Only: HDFS doesn’t support random writes or overwrites. Files are written once, read many times.

High Throughput Over Low Latency: Hadoop cares about aggregate throughput (GB/s across cluster), not individual operation latency (milliseconds).

These patterns are very different from OLTP databases (random small I/O) or virtual desktop (boot storms).

Capacity Planning

Hadoop capacity planning has unique considerations:

Replication Factor: With 3x replication, you need 3x the raw capacity. 100 TB of data requires 300 TB of raw storage.

Intermediate Data: MapReduce creates intermediate data during processing. Plan for 25-50% overhead beyond input and output data.

Growth: Big data tends to grow rapidly. Plan for significant growth headroom.

Node Sizing: Balance capacity per node with cluster size. Too few large nodes means less parallelism. Too many small nodes means more overhead.

Typical Hadoop node: 12-24 TB raw capacity (12-24 x 1-2 TB disks).

Network Considerations

Network is critical for Hadoop:

Rack-Aware Topology: HDFS understands network topology. Configure rack awareness so HDFS can place replicas intelligently.

Bandwidth: Ensure adequate bandwidth for data replication, MapReduce shuffles, and data loading.

Oversubscription: Some oversubscription is acceptable, but too much causes performance problems during shuffle phases.

10 GbE: Increasingly standard for Hadoop clusters to support high throughput.

Network is often a bottleneck in Hadoop clusters. Don’t under-provision it.

NameNode High Availability

The NameNode is a single point of coordination. If it fails, the cluster is down. HDFS High Availability addresses this:

Standby NameNode: A second NameNode runs in standby mode, ready to take over.

Shared Storage: Edit logs are written to shared storage (typically NFS) so standby can take over seamlessly.

Automatic Failover: With ZooKeeper, failover can be automatic.

For production clusters, NameNode HA is essential. The NameNode is too critical to be a single point of failure.

SAN vs. DAS for Hadoop

A common question: Can you use SAN storage for Hadoop instead of direct-attached storage (DAS)?

Arguments for SAN:

Leverage existing SAN infrastructure
Centralized storage management
Array-based data protection

Arguments Against SAN:

Loses data locality benefits
Network becomes bottleneck
Defeats Hadoop’s scale-out economics
Replication provides redundancy, making array RAID redundant

My view: SAN defeats Hadoop’s architectural advantages. Use DAS for production Hadoop except in specific scenarios (very small clusters, proof-of-concept).

Compression and Encoding

Compression is important for Hadoop:

Reduces Storage: Compression ratios of 2-5x are common, significantly reducing storage requirements.

Improves Performance: Less data to read from disk and transfer over network often outweighs decompression CPU cost.

Codec Choice: Different codecs balance compression ratio vs. CPU cost. Snappy provides modest compression with low CPU. Gzip provides better compression with more CPU.

Splittability: Some formats (like bzip2) support splitting compressed files for parallel processing. Others don’t.

Plan compression strategy as part of storage design.

Data Ingestion

Getting data into Hadoop efficiently:

Bulk Loading: Use tools like distcp for large dataset imports.

Streaming Ingestion: Tools like Flume for continuous data streams.

Network Bandwidth: Data ingestion consumes network bandwidth. Plan accordingly.

Replication Overhead: As data is written, it’s replicated to multiple nodes, multiplying network traffic.

For large datasets, data loading can take days. Plan ingestion carefully.

Data Retention and Lifecycle

Hadoop stores massive amounts of data. Lifecycle management is essential:

Tiering: Move old data to cheaper storage (larger SATA drives, archival systems).

Deletion: Delete data that’s no longer needed. Storage isn’t infinite.

Compression: Compress older data more aggressively.

External Archives: Move very old data to tape or cloud storage.

Without lifecycle management, Hadoop clusters fill up quickly.

Performance Tuning

Hadoop performance tuning considerations:

Block Size: Larger blocks (128 MB) reduce NameNode memory but increase minimum parallelism granularity.

Replication: Reducing to 2x replication saves storage but reduces reliability.

DataNode Disk Configuration: More spindles per node increases aggregate I/O.

Memory: Adequate memory for caching improves performance.

Network: Ensure network doesn’t bottleneck during shuffle phases.

Performance tuning requires understanding your specific workloads.

Integration with Enterprise Storage

While Hadoop uses DAS, there are integration points with enterprise storage:

Data Loading: Move data from enterprise storage to Hadoop for analysis.

Results Export: Move results from Hadoop back to enterprise storage.

Backup: Back up critical Hadoop data to enterprise storage.

Archive: Archive old Hadoop data to enterprise storage or tape.

The enterprise and Hadoop storage ecosystems need to interoperate even if they’re architecturally different.

Cloud Considerations

Running Hadoop in the cloud:

Ephemeral Storage: Cloud instance storage may be ephemeral. Use object storage (S3) for persistent data.

Network Costs: Data transfer between nodes may have costs.

Elasticity: Add nodes during processing, remove afterward to save costs.

S3 vs. HDFS: S3 for permanent storage, HDFS on instances for processing.

Cloud changes Hadoop economics and architecture.

Monitoring and Management

Monitor Hadoop storage:

Capacity: Track used/available capacity per node and cluster-wide.

Disk Failures: Monitor for failed disks and replace promptly.

Under-Replicated Blocks: Indicate replication hasn’t completed or nodes are down.

NameNode Heap: NameNode memory usage grows with file count.

Data Skew: Ensure data is distributed evenly across nodes.

Hadoop management tools provide visibility, but you need to watch the right metrics.

When Hadoop Storage Makes Sense

Hadoop storage is optimal for:

Large-Scale Analytics: Processing petabytes of data.

Append-Only Workloads: Log analysis, clickstream data, sensor data.

Cost-Sensitive: When storage economics matter more than absolute performance.

Unstructured Data: Text, images, videos that don’t fit well in relational databases.

Hadoop storage doesn’t make sense for:

Transactional Workloads: OLTP databases need random write capability and low latency.

Small Data: Overhead isn’t worthwhile for gigabyte-scale datasets.

Real-Time: Hadoop is batch-oriented, not real-time.

Use the right storage for each workload.

Lessons for Traditional Storage

What can traditional storage learn from Hadoop?

Embrace Failure: Design for failure rather than trying to prevent all failures.

Software Intelligence: Sophisticated software can make commodity hardware reliable.

Scale-Out: Horizontal scaling has better economics than vertical scaling.

Data Locality: Co-locating compute and storage reduces data movement.

These principles are influencing enterprise storage design.

Conclusion

Hadoop represents a fundamentally different approach to storage. Rather than preventing failures with expensive hardware, it accepts failures and handles them in software. Rather than centralized storage, it distributes data across many nodes.

This approach works well for specific use cases—large-scale analytics on append-only data—but isn’t universal. Understanding when Hadoop storage makes sense vs. when traditional storage is better is important.

Working at Cisco on traditional storage, I find Hadoop fascinating. The principles—software-defined reliability, scale-out architecture, data locality—are influencing all storage technology.

As big data grows in importance, understanding Hadoop and its storage model becomes essential for storage professionals. You may not deploy Hadoop yourself, but its architectural patterns will influence the storage technologies you do use.

The storage world is diverse enough for multiple paradigms. The future isn’t Hadoop replacing traditional storage or vice versa—it’s both coexisting, each used where its strengths align with workload requirements.