The Data Economy: Understanding the Hadoop-Data warehouse balance of power

hadoop-in-spaceWill Hadoop replace your enterprise data warehouse (EDW)?

This question, or some variation there of, has been making the rounds lately. Just this week I’ve read two good posts on the topic (this one from Matt Asay and this one from Timo Elliot), and my Twitter feed is full of related commentary.

The answer to this question has significant ramifications for data warehouse vendors and the $10 billion plus EDW market, so its not surprising its getting so much attention. So what’s the answer?

Well, it depends on what you mean by “replace.” Sorry for the nuance, but nuance is required in this case.

Wikibon agrees with Asay, Elliot and others that Hadoop is not going to outright replace your EDW. The EDW is a mature technology that supports many mission-critical workloads related to business intelligence reporting. Many executives and managers rely on these reports to run their businesses. Hadoop is not capable of supporting many of these mission-critical workloads with the levels of performance, reliability, security or usability required.

However, Hadoop is capable of supporting some non-mission-critical (but often storage- and/or compute-intensive) EDW workloads and does so at a fraction of the cost. The most obvious of these workloads is data transformation, but there are others. Enterprise practitioners are already beginning to shift these workloads from the EDW to Hadoop, resulting in lower costs and better performing data warehouses.

Matt Brandwein, Director of Product Marketing at Cloudera, gave a great example during a recent webinar (which is definitely worth watching in full.) He cited the case of one company that discovered 5% of workloads in its EDW were consuming 60% of EDW compute resources. The company shifted these workloads (in this case ETL jobs) to Hadoop, saving money and freeing up CPU in the EDW for higher-value workloads.

(Of course, Hadoop is also a great platform for a number of other workloads that aren’t possible with conventional EDW technology, including large-scale exploratory analytics and crunching unstructured and multi-structured data.)

The answer to the original question, then, is that Hadoop will replace the EDW for specific workloads, but not the EDW itself. This means data warehouse vendors now face competition from commercial Hadoop vendors for some of the same dollars related to these overlapping workloads, and growth rates for data warehouse vendors are likely to slow if not stagnate. But it’s not a zero-sum game, as Asay points out, nor is Hadoop an existential threat to EDW vendors.

In fact, all that new data created in Hadoop could make its way to the EDW eventually, actually resulting in more data under management for EDWs (and more revenue for EDW vendors.) And EDW vendors are introducing new capabilities that allow better integration with Hadoop but solidifies the EDW as the dominant platform in the relationship between the two (see Teradata’s recent QueryGrid release.)

But that’s not the only possibility. As the open source community and Hadoop vendors add more analytic capabilities to Hadoop and improve enterprise-grade features and performance, this paradigm could be flipped on its head. Eventually, Hadoop could support more valuable workloads, such as advanced analytics and business intelligence, than the EDW. In this scenario, Hadoop serves as the dominant data management platform in the enterprise, with the EDW serving as a tactical adjunct tool for important but less valuable tasks.

Consider the improvements made to Hadoop over just the last year. In April 2013, Cloudera introduced Impala, enabling SQL-like capabilities on Hadoop. In the fall, Hortonworks (with significant contributions from the open source community) debuted YARN, or Yet Another Resource Negotiator, transforming Hadoop from a one trick pony (MapReduce) to a multi-application framework. And just today, MapR announced it has added Apache Spark to its enterprise distribution, brining in-memory processing to Hadoop.

As more analytic capabilities such as these are developed and security, data governance and reliability improve, it’s not inconceivable that the balance of power between Hadoop and the EDW could shift.

Today, in all but the most sophisticated enterprises, the EDW performs the vast majority of high-value workloads, while Hadoop covers many important but less glamorous tasks. Eventually, though, the EDW could compliment Hadoop, rather than the other way around .