Accelerating Big Data Analytics with Flash Caching

The global volume, velocity and variety of data are all increasing, and these three dimensions of the data deluge—the massive growth of digital information—are what make Hadoop software ideal for big data analytics. Hadoop is purpose-built for analyzing a variety of structured and non-structured data, but its biggest advantage is its ability to cost-effectively analyze an unprecedented volume of data on clusters of commodity servers.

While Hadoop is built around the ability to linearly scale and distribute MapReduce jobs across a cluster, there is now a more cost-effective option for scaling performance in Hadoop clusters: high-performance read/write PCIe flash cache acceleration cards.

 

Scaling Hadoop Performance: A Historical Perspective

 

The closer the data to the processor, the less the latency and the better the performance. This fundamental principle of data proximity is what has guided the Hadoop architecture, and is the main reason for Hadoop’s success as a high-performance big data analytics solution.

To keep the data close to the processor, Hadoop uses servers with direct-attached storage (DAS). And to get the data even closer to the processor, the servers are usually equipped with significant amounts of random access memory (RAM).

Small portions of a MapReduce job are distributed across multiple nodes in a cluster for processing in parallel, giving Hadoop its linear scalability. Depending on the nature of the MapReduce jobs, bottlenecks can form either in the network or in the individual server nodes. These bottlenecks can often be eliminated by adding more servers, more processor cores, or more RAM.

With MapReduce jobs, a server’s maximum performance is usually determined by its maximum RAM capacity. This is particularly true during the Reduce phase, when intermediate data shuffles, sorts and merges exceed the server RAM size, forcing the processing to be performed with input/output (I/O) to hard disk drives (HDDs).

As the need for I/O to disk increases, performance degrades considerably. Slow storage I/O is rooted in the mechanics of traditional HDDs and this increased latency of I/O to disk imposes a severe performance penalty.

One cost-effective ways to break through the disk-to-I/O bottleneck and further scale the performance of the Hadoop cluster is to use solid state flash memory for caching.

 

Scaling Hadoop Performance with Flash Caching

 

Data has been cached from slower to faster media since the advent of the mainframe computer, and it remains an essential function in every computer today. The enduring and widespread use of caching demonstrates its enduring ability to deliver substantial and cost-effective performance improvements.

When a server is equipped with its full complement of RAM and that memory is fully utilized by applications, the only way to increase caching capacity is to add a different type of memory. One option is NAND flash memory, which is up to 200 times faster than a high-performance HDD.

A new class of server-side PCIe flash solution uniquely integrates onboard flash memory with Serial-Attached SCSI (SAS) interfaces to create high-performance DAS configurations consisting of a mix of solid state and hard disk drive storage, coupling the performance benefits of flash with the capacity and cost advantages of HDDs.

 

Testing Cluster Performance With and Without Flash Caching

 

To compare cluster performance with and without flash caching, LSI used the widely accepted TeraSort benchmark. TeraSort tests performance in applications that sort large numbers of 100-byte records, which requires a considerable amount of computation, networking and storage I/O—all characteristics of real-world Hadoop workloads.

LSI used an eight-node cluster for its 100-gigabyte (GB) TeraSort test. Each server was equipped with 12 CPU cores, 64 GB of RAM and eight 1-terabyte HDD as well as an LSI® Nytro MegaRAID 8100-4i acceleration card combining 100GB of onboard flash memory with intelligent caching software and LSI dual-core RAID-on-Chip (ROC) technology. The acceleration card’s onboard flash memory was deactivated for the test without caching.

No software change was required because the flash caching is transparent to the server applications, operating system, file subsystem and device drivers. Notably, RAID (Redundant Arrays of Independent Disks) storage is not normally used in Hadoop clusters because of the way the Hadoop Distributed File System replicates data among nodes. So while the RAID capability of the Nytro MegaRAID acceleration card would not be used in all Hadoop clusters, this feature adds little to the overall cost of the card.

LSI internal testing with flash caching activated found that the TeraSort test consistently completed approximately 33 percent faster. This performance improvement from caching scales in proportion to the size of the cluster needed to complete a specific MapReduce or other job within a required run time.

LSI Nytro MegaRAID card using the TeraSort benchmark completed Hadoop jobs 33 percent faster (LSI internal test; individual results may vary).

 

Saving Cash with Cache

 

Based on results from the internal LSI TeraSort benchmark performance test, the table below compares the estimated total cost of ownership (TCO) of two cluster configurations—one with and one without flash caching—that are both capable of completing the same job in the same amount of time.

 

Without Caching

With Caching

Number of Servers

1000

750

Servers (MSRP of $6,280)

$6,280,000

$4,710,000

Nytro MegaRAID Cards (MSRP of $1799)

$0

$1,349,250

Total Hardware Costs

$6,280,000

$6,059,250

Costs for Rack Space, Power, Cooling and Administration Over 3 Years *

$19,610,000

$14,707,500

3-Year Total Cost of Ownership

$25,890,000

$20,766,750

* Cost computed using data from the Uptime Institute, an independent division of The 451 Group (See here.)

The tests showed that in certain circumstances, using fewer servers to accommodate the same processing time requirement can reduce TCO by up to 20 percent, or $5.1 million, over three years.

 

Conclusion

 

Organizations using big data analytics now have another option for scaling performance: PCIe flash cache acceleration cards. While these tests centered on Hadoop clusters, LSI’s extensive internal testing with various databases and other popular applications consistently demonstrates performance improvement gains ranging from a factor of three (for DAS configurations) to a factor of 30 (for SAN and NAS configurations).

Big data is only as useful as the analytics that organizations use to unlock its full value, making Hadoop a powerful tool for analyzing data to gain deeper insights in science, research, government and business. Servers need to be smarter and more efficient and flash caching helps enable fewer servers (with fewer software licenses) to perform more work, more cost-effectively for data sets large and small—a great option for IT managers working to do more with less under the growing pressure of the data deluge.

About the Author

 

Kimberly Leyenaar is a Principal Big Data Engineer and Solution Technologist for LSI’s Accelerated Storage Division. An Electrical Engineering graduate from the University of Central Florida, she has been a storage performance engineer and architect for over 14 years. At LSI, she now focuses on discovering innovative ways to solve the challenges surrounding Big Data applications and architectures.