UPDATED 14:11 EDT / SEPTEMBER 09 2015

NEWS

Wikibon analyst sees urgent need to simplify Hadoop ecosystem

What do Sqoop, Flume, Kafka, Spark Streaming, Flink, Hortonworks Dataflow, Samza, Data Torrent, and Storm have in common? They can all be used to deliver data in batch or streaming modes to Hadoop clusters. The difficulty for enterprise IT organizations is deciding which ones to use, not to mention finding the skills necessary to operate them.

George GilbertThis complexity is the subject of “Simplifying and Future-Proofing Common Hadoop Use Cases,” a new research report by Wikibon analyst George Gilbert (right). Gilbert presents two scenarios – one involving a combination of Hadoop and a traditional data warehouse and the other involving integration of streaming and historical data (see diagram above) – to illustrate how Hadoop has become an essential tool for enterprises seeking to build Systems of Intelligence, and also how complexity holds back Hadoop’s promise.

The still-immature technology has spawned a legion of complementary tools, each of which must be evaluated for its unique utility. New tools are also emerging all the time, forcing IT organizations to scramble just to keep up with the latest developments. The Hadoop ecosystem desperately needs overarching systems to simplify and manage the complexity beneath.

“Hadoop is one of the most innovative ecosystems the industry has ever seen.  But fragmentation and complexity are the trade-offs of all this rapid evolution while the platform is still maturing,” Gilbert writes. “Choice has a cost.”

Spark conundrum

A prominent example of this dynamic is Apache Spark, which roared into the mainstream this year as a promising alternative to Hadoop’s older MapReduce processing engine while drawing major endorsements from IBM and, just yesterday, Cloudera, Inc. Users who had invested in MapReduce must now evaluate not only Spark, but a host of complementary tools like Impala, Drill, Presto, Flink and Samza.

“The downside of all the disruptive innovation is that a great many choices introduce fragmentation by solving a narrow part of the problem,” Gilbert writes.

The payoff is that Hadoop has brought a valuable new dimension to big data analytics by not only reducing the cost but introducing a whole new ways to look at data. Gilbert notes that data warehouse capacity can cost as much as $35,000/TB for hardware and software, with as much as 40% of the workload devoted to basic extract/transform/load (ETL) processing. In contrast, Hadoop clusters cost as little as $1,500/TB and can run on low-cost commodity platforms.

Nevertheless, Gilbert doesn’t see Hadoop displacing traditional data warehouses anytime soon, primarily because the technologies are designed for different purposes.

Tale of two data stores

Data warehouses are typically highly structured and designed to answer questions that are already known. Having benefited from decades of development, they are relatively reliable and easy to use, once the extensive data cleansing and formatting work has been done.

Hadoop clusters, in contrast, are optimized for exploratory analytics, in which questions are not well-known in advance and the data itself may uncover new questions to answer. This process can yield insights that bring Hadoop and data warehouses together. “For all the activities that start with exploratory data, a Hadoop cluster is the natural starting point,” he writes. “A data warehouse is the opposite.”

Hadoop is also well suited to real-time processing. This characteristic, when combined with Hadoop’s exploratory strength, makes it the best option for transitioning to Systems of Intelligence, in which the role of IT moves from recording transactions to high-value tasks like predicting markets and behaviors. “Hadoop can augment existing operational Systems of Record and transform them into Systems of Intelligence,” Gilbert writes.

The analyst provides two scenarios that show both the value and complexity of Hadoop. In the first, a company uses Hadoop to dig through raw data from operational systems so business analysts can identify trends to explore in more detailed in the corporate data warehouse.

In the second scenario, a telecommunications company combines historical data with current customer relationship management records and real-time call information streamed from cell towers to determine which customers should receive the most generous offers to compensate for poor call quality. This kind of operational analytics was impossible to perform cost-effectively before Hadoop and its ecosystem developed.

Both scenarios illustrate the unique value that Hadoop provides, but both also show the overwhelming number of choices that users must make. Fortunately, Gilbert expects that time will resolve many of these complexity issues as the rapidly expanding Hadoop ecosystem turns its attention from functionality to simplicity and independent software vendors rush to exploit an enormous opportunity.


A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU