All Is Not Kumbaya In the Hadoop Open Source Community


When you think “open source community,” do you think of a tightly knit group of developers linked arm-in-arm dedicated to building and promoting an open standards-based technology?

Well, that scenario may be common in the early days of any given open source project, but internal rivalries are sure to develop sooner or later over the direction of the project, over who claims credit for its advancement, and over who makes the most money via commercial support.

Well, don’t look now but the simmering, behind-the-scenes rivalry between Cloudera and Hortonworks is now out in the open. The two are in the midst of a war-of-words over which of the two has contributed the most to Apache Hadoop.

It started with a recent blog post by Hortonworks called The Yahoo Effect. In it, Hortonworks CEO Eric Baldeschwieler said the company was committed to continuing contributions to Apache Hadoop and he highlighted past Yahoo and (by extension) Hortonworks contributions:

Source: Hortonworks 2011

There continues to be a very incorrect assumption by some in the market, however, that Yahoo! will no longer be a major contributor to Apache Hadoop moving forward. Nothing could be further from the truth. In fact, if you look at the diagram below, even if you exclude all of the code contributions made by the former Yahoo! team now at Hortonworks, Yahoo! is still the largest contributor to Apache Hadoop.

Hold on there, says Cloudera. In a rebuttal post, Cloudera CEO Mike Olson wrote:

With no disrespect to Yahoo!, however, the monolithic wall of green in Figure 1 tells a misleading story about the past, present and future of Apache Hadoop.

Source: Cloudera 2011

It’s absolutely correct to note that Yahoo! covered the salaries of contributors in the early years. Five years is an eternity in the tech industry, however, and many of those developers moved on from Yahoo! between 2006 and 2011. If you look at where individual contributors work today — at the organizations that pay them, and at the different places in the industry where they have carried their expertise and their knowledge of Hadoop — the story is much more interesting.

Back to Hortonworks. In a rebuttal to Cloudera’s rebuttal. Baldeschwieler says that

Source: Hortonworks 2011

comparing patches and codes is misleading, as “patches differ in their investment of time and effort … We strongly believe that the lines of code contributed is a significantly more relevant metric.” In other words, Hortonworks/Yahoo employees may have contributed fewer overall patches, but their patches were more complex, contained significantly more lines of code than the average Cloudera patch, thereby making Hortonworks/Yahoo the biggest contributor to Apache Hadoop.


So who’s right? And, ultimately, does it even matter who contributed the most code and/or patches to Apache Hadoop? I’m going to hedge here and say yes … and no.

Yes, it matters on a macro-level to potential customers that the vendor they choose to support their Hadoop deployments with add-ons and services contributed significantly to the underlying code. Who better understands Hadoop and the services required to support enterprise deployments than those that played a significant role in creating it, goes the thinking. So I understand why Cloudera and Hortonworks are both eager to claim the mantle of “Biggest Hadoop Contributor.”

However, when it comes down to a bake-off situation, with a potential customer testing both Cloudera’s distribution and services versus Hortonworks’ competing offerings, which vendor contributed the most code/patches to Hadoop will be just one of the deciding factors. Since they are far and away the two biggest contributors to the project, in fact, each vendor’s claim may just cancel out the other. The bigger factors are going to be who provides the more stable, easy-to-manage Hadoop cluster and the services necessary to turn resulting analysis into direct business value.

Still, the spat makes for fun reading.