Kudu: How Cloudera wants to save Hadoop by killing it


The massive drop in memory prices that is leading Hadoop adopters to abandon the disk-oriented MapReduce has now finally caught up to the storage component of the framework as well with the introduction of an alternative from none other than Cloudera Inc., its prime distributor. The move signals the beginning of the end for the decade-old project in its present form.

Like MapReduce, the Hadoop File System was created in a time when the most viable option for processing large amounts of unstructured records was storing the data on disk and slowly bringing small pieces into memory for analysis. The community has worked to adapt to the framework as the underlying economies of the situation shifted over years, but to limited effect.

And thus the Hadoop File System became a bottleneck for the growing number of organizations that are turning to Spark in hopes of exploiting the large amount of affordable memory suddenly at their disposal to remove the overhead involved in shuffling data back and forth from disk. That is proving detrimental to Hadoop as a whole, with a recent study finding that standalone deployments of Spark are quickly becoming the norm.

The soon-to-launch Kudu is Cloudera’s attempt to reverse that trend. It’s the product of a development effort spanning more than three years that began when its engineers realized that the changes needed to address the shift in infrastructure composition were too great to implement in the Hadoop File System or the complementary HBase database.

The result is a columnar store that combines the best qualities of both to provide what is touted as a unified platform for supporting the machine learning and predictive analytics workloads that organizations are running on Spark. Kudu exploits the abundance of memory in modern analytics clusters to make large parts of the information inside, including the metadata, instantly available for modification,

The cached changes are periodically propagated to disk in a single efficient batch that requires less overhead to write to disk than multiple small operations. Like the Hadoop File System, Kudu distributes the work across all the machines in a cluster and designates a master node to keep everything coordinated.

Users will eventually be able to spread out that later duty among multiple servers similarly to how data ingestion is currently handled for greater reliability, one of the many features that Cloudera has in the pipe for Kudu. On one hand, that’s encouraging for organizations that may be considering to jump aboard the bandwagon, but on the other, it’s also a measure of the technology’s present immaturity.

The roadmap for Kudu is long and difficult, not only in the technical sense and from a more strategic perspective as well. Cloudera’s efforts to merge the capabilities of the Hadoop File System and HBase in a single platform highlights a broader consolidation of the project that is best reflected by Spark, which can substitute many of the disparate components in current distributions with native addons, leaving less opportunities for vendors to add value.

That will make it harder for Cloudera and its peers to remain competitive as the engine and its specialty components in particular continue to gain steam. As a result, the consolidation that is occurring in the upstream ecosystem today may very well end up spilling over to the vendors trying to commercialize it tomorrow. “The core message about the Hadoop ecosystem getting hollowed out by Spark is the single biggest trend going on in big data right now,” commented Wikibon’s George Gilbert.

Photo via Skeeze