The Better Model for Hadoop: Open Source or Proprietary Approach?


There’s a debate going on inside the Hadoop community. On one side are open source purists who believe that data infrastructure software should be free and open with revenue generated solely from services. On the other side of the debate are those who feel the only viable long-term business model is selling proprietary software built on top of an open core.

What’s the better model for Hadoop: Open source or proprietary approach?

The two leading examples of the competing sides of the Hadoop debate are Hortonworks and Cloudera. If you’re not sure which company takes which side of the debate, consider these recent comments from Cloudera’s Chief Strategy Officer (and until recently CEO) Mike Olson. Writing on LinkedIn, Olson says, “You can’t build a successful stand-alone company purely on open source … Pure-play open source companies never survive. That’s a law of nature.”

Though he doesn’t call out Hortonworks by name, nobody doubts to whom Olson was referring in the post. Hortonworks, which was spun out of Yahoo’s Hadoop team in 2011, takes a purely open source approach. The company’s Hadoop distribution, Hortonworks Data Platform (HDP), is 100% Apache open source and anyone can download and use it free of charge. HDP includes Apache Ambari, open source Hadoop management software for cluster provisioning and integration with other enterprise management software. Hortonworks makes its money strictly from technical support subscriptions and other services such as training and education.

Cloudera, the first commercial Hadoop vendor on the market and current leader in number of enterprise deployments, uses a proprietary overlay model. For Cloudera, this means offering an open source Hadoop core – HDFS, Hbase, Flume, Sqoop, etc. – for free and selling proprietary management and monitoring tools that provide functionality such as LDAP integration, operational metrics reporting and access management controls. Cloudera also offers paid-for technical support and training services.

As I see it, each approach has its pros and cons. First the open source approach (and this is strictly from a vendor, not customer, perspective):


1. Vendors with a pure open source approach can take full advantage of community innovation across all aspects of product development.

2. Easier and faster to “get a foot in the door” at potential customers due to the lowest of low barriers to entry – i.e. the fully functioning software is free.

3. A pure open source approach makes it easier for vendors to integrate and closely partner with established channel partners that can open up significant market opportunities.


1. In pure open source scenarios, it often takes longer for significant revenue generation to ramp up as customers wait to engage sizeable support services until deployments are ready to move into large-scale production.

2. Open source communities are sometimes chaotic and difficult to get moving in unison, making product roadmap planning more challenging.

3. Services-only business models are manpower intensive, making them more difficult to scale than traditional software sales models.

As for the proprietary overlay approach:


1. Assuming it has smart people, a vendor taking the proprietary overlay/open core approach can potentially create value differentiation faster than open source competitors relying in part on the sometimes-chaotic open source community.

2. Selling software licenses on top of an open core is today a fairly conventional and well-understood business model, making it an easier “sell” to enterprise CIOs.

3. An open core still allows vendor taking the proprietary overlay approach to maintain some level of credibility in the open source community.


1. Any proprietary IP created by vendors in markets with active open source communities loses its value as soon as the community (or competitors) catches up in terms of functionality.

2. Depending on the market, proprietary components may compete with incumbent vendors and products, making it more difficult to integrate and partner.

3. Vendors taking the proprietary overlay approach risk being seen as taking advantage of the open source community if they fail to contribute significant code back to the core open source projects (and even then, community backlash is possible.)

Already we are seeing the results of both models starting to play out. For Cloudera, this translates to an early lead in customers and revenue generation. But the company has taken some criticism for being less than easy to partner with and projects like the Ambari are quickly moving towards parity with Cloudera’s various proprietary components.

For Hortonworks, the company has already established key reseller arrangements with some of the biggest names in enterprise tech – including Microsoft, SAP and Teradata – part if its strategy to win the long-game. But the long game takes time, by which point it risks seeding too much ground to Cloudera.

So why does this competition matter? It matters because for Hadoop to reach its full potential – making all manner and volume of data available for analysis – there must be at least one successful commercial entity supporting the framework that enterprise customers can confidently count on to be around for the long haul (whether as a standalone vendor or as part of a larger company.) Such a vendor (or vendors) plays a number of roles, most importantly (1) making Hadoop consumable by packaging the framework’s disparate components into an easily downloadable distribution and (2) developing enterprise-grade features to and providing enterprise-level support for Hadoop. Without a viable commercial supporter, it is unlikely any open source technology could achieve widespread enterprise adoption. Hadoop included.

It’s too early to tell which model will prevail, or even if there is room in the Hadoop market for two “winners.” But there is one major point upon which Cloudera and Hortonworks agree. Open source software has proven that it can significantly disrupt entrenched and highly lucrative markets. Case in point, of course, is Linux, which knocked Microsoft from its lofty perch as the dominant operating system in corporate data centers. Hadoop has the potential to have a similar impact on the relational database market today, currently worth nearly $24b+.

Editor’s Note: Join us this Thursday, October 10th at 3pm ET for a live CrowdChat to debate this hot topic. The CrowdChat is the first in a series of three leading up to theCUBE at #BigDataNYC event on October 29 and 30. Click here and sign in with your Twitter or LinkedIn account to join the conversation via the unique and highly engaging experience known as CrowdChat.