UPDATED 16:39 EDT / DECEMBER 06 2011

NEWS

How Many Proprietary, Value-Add Components will Hadoop Community Accept?

Last week, Hortonworks announced its engineers, with the help of the Apache Hadoop community, developed a new RESTful-based protocol called WebHDFS to enable a wider range of application to access the Hadoop Distributed File System. WebHDFS, which is Apache compatible, is designed to make it easier for applications sitting outside Hadoop clusters to connect to HDFS.

According to Hortonworks:

WebHDFS supports all HDFS user operations including reading files, writing to files, making directories, changing permissions and renaming. In contrast, HFTP (a previous version of HTTP protocol heavily used at Yahoo!) only supports the read operations but not the write operations. Read operations are the operations that do not change the status of HDFS including the namespace tree and the file contents.

With WebHDFS in place, Unix/Linux utilities and non-Java applications can connect with HDFS, according to Hortonworks.

Meanwhile, MapR today released v1.2 of its Hadoop distribution. According to MapR, the latest iteration includes performance improvements to MapR’s versions of HBase, Hive and Pig; extended support for C/C++ API access; Added support for Windows and Mac clients; and a new offering of MapR running on a VMware virtual machine.

Both announcements, along with Cloudera’s contributions to the latest Apache ZooKeeper release, will benefit end-users and further illustrate that Hadoop is ready for prime-time.

Hadoop Ideological Battle Still Raging

But the distribution improvements updates also illustrate the continuing ideological battle taking place in the Hadoop community. The overriding question, which has yet to be answered, is how many proprietary, value-add components are the Hadoop community/user-base willing to accept from distribution vendors like EMC/MapR and, to a lesser extent, Cloudera?

Hortonworks is betting none. It posits that enterprises that adopt Hadoop want a purely open-source distribution to, among other things, ensure backwards compatibility and avoid lock-in risk. But Hortonworks customers must then rely on the Hadoop community to innovate quickly to catch-up to any proprietary performance improvements made by EMC/MapR and the like. Hortonworks plans to differentiate itself based on its technical support offerings, which Co-Founder Arun Murthy says it honed at Yahoo!

Cloudera is taking a similar approach around the Hadoop distribution itself, but believes enterprises are willing to pay for Cloudera’s proprietary management console, which, as Cloudera Amr Awadallah claims, simply allows Cloudera to provide support faster and does not represent a lock-in risk for customers.


Watch live video from SiliconANGLE.com on Justin.tv

MapR, with its proprietary NFS, which the company claims improves upon HDFS performance by an order of magnitude, is taking a more closed-source, traditional software approach though its distribution does include some Apache Hadoop open source components

Tresata’s Abhi Mehta, whose firm provides Big-Data-as-a-Service based on Hadoop for banks and financial firms, told me that most enterprise customers he talks to are less inclined to pay for a Hadoop distribution and prefer to invest their money and effort into Big Data applications that deliver real business value. I largely agree with that analysis, but I think there is some room in the market for a partially proprietary, value-add Hadoop distribution.

Specifically, I believe more “traditional” enterprises that are uncomfortable deploying open source technology (as limited as that view is, IMO) are likely candidates to embrace some level of closed-source distributions. Companies that have a pressing business initiative that proprietary components like NFS can help them realize in the very short term  (as opposed to having to wait around for the open source community to reach parity with any given Hadoop component) are also good targets for partially proprietary Hadoop approaches.

In the long-term, however, I don’t think the Hadoop market can sustain three competing distributions and either Hortonworks Apache-compatible HDP or Cloudera’s largely Apache-compatible CDH will emerge as the dominant Hadoop distribution. The other vendor or EMC/MapR will likely settle in as a profitable but significantly smaller number two, while the third vendor will likely have to look for a new line of work.

Which ever vendor comes out on top, it is in the best interest of end-users that this battle come to a close sooner rather than later. The real value in Hadoop is in the applications that allow enterprises to solve real business problems and innovate and create new products/businesses. The more time enterprises spend evaluating competing Hadoop distributions, the less time they spend implementing game-changing Big Data analytics and applications. That said, with stakes this high, I expect the Hadoop distribution battle to continue apace through at least the first half of 2012.


A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU