UPDATED 12:21 EDT / APRIL 21 2016

NEWS

Spark muscling in on Hadoop’s territory, says Wikibon analyst

Is Apache Spark the successor to Hadoop?

Some people think so. Given the batch-oriented Hadoop’s complexity and the notorious performance problems of the MapReduce processing framework and the components that depend upon it, the integration and speed that the in-memory Spark analytics engine brings to the table has a lot of appeal.

The big news in big data for the last 12 months has all been about real-time and streaming analytics. Spark, which became a top-level Apache project in 2014, has gained endorsements from every major big data player. A new player that’s been stirring up excitement is Apache Flink, a true stream processing engine whose developers recently landed $6 million in financing for their Flink-focused startup, Data Artisans.

In a recent interview with SiliconANGLE, Data Artisans CEO Kostas Tzoumas flatly declared that streaming analytics will make batch analytics all but obsolete in a few years.

But Wikibon analyst George Gilbert doesn’t see wholesale change in the offing. Business processes evolve a lot more slowly than technology, he noted. “Take the apps that have dominated computing for 50 or 60 years like payroll. Much of it is batch,” Gilbert said. “What isn’t batch is request/response, which is another way of describing end-users interacting with an application using a screen. For streaming to blow that away completely would take a long time.” Internet-centric companies are adopting streaming analytics faster than the rest of the market, he added.

Bullish

That doesn’t mean Gilbert isn’t bullish on the potential of Spark in particular and streaming analytics in general. Quite the contrary. In a recent research note, he predicted that Spark will be the principal driver of the big data market’s growth over the next decade.

“By 2026, 59 percent of all big data spending will be tied to Spark or related streaming analytics as enterprises seek to deploy applications that can make decisions on behalf of individuals,” he wrote.

The reasons relate both to speed and simplicity. Hadoop introduced important concepts like moving analytics engines close to the data and incorporating unstructured data into a data lake, but it has developed incrementally into an ecosystem of more than 30 discrete components that can be daunting to coordinate.

“We thought we were at this big data nirvana, but it turns out this stuff is ferociously complex,” he said. “Rather than a mix and match approach that’s become like a zoo to manage, Spark wrapped a lot of the components up in one-easy-to-administer and easy-to-program package.”

Gilbert foresees Spark taking the place of many of the analytical tools that enterprises are now using. The reasons have as much to do with simplicity as performance. “It’s not just an integrated execution engine for developers. The Spark cluster is all the administrator has to deal with,” he said.

Storage revolution

Hadoop is still needed. One fundamental reason is that Spark lacks a storage component, so technologies like HBase and Hadoop’s HDFS file system fill that gap. An interesting new player in the storage equation is Apache Kafka, a high-speed message queuing system that actually plays the role of a file system for stream processors.

“Kafka keeps track of everything as long as you want, so a streaming engine can dip in and get what it needs,” Gilbert said. “It’s like a file system for stream processing.”

Flink is another factor. It’s a true real-time analytics engine that is better suited than Spark for applications that demand low-latency continuous processing, such as monitoring sensor data.

Spark has added real-time-like features through the Spark Streaming project, but it’s still fundamentally a “micro-batch” architecture for now, meaning that it simulates real-time analytics by processing small volumes of data quickly in batch mode. For most applications Spark is good enough, but true stream processing will demand a combination of Flink and Kafka unless Spark is able to evolve beyond its micro-batch approach to add per-event streaming.

Gilbert sees great promise in in-memory analytic engines because, “We are on the cusp of a major transformation in servers where much of storage will actually become a form of memory” in the form of flash and other new technologies. “That means it will easy to do more things with Flink and Spark that you would have previously done in batch mode.” Kafka could become the ingest layer and Hadoop could house the data lake.

This new, more interactive framework will be used differently than big data tools were in the past. Gilbert calls them “Systems of Intelligence.”

“Data warehousing applications gave you a definitive answer to the question you asked,” he said. “In the future you’ll get a best guess or most likely recommendation for what should happen next.” But just how this more predictive approach to analytics will play out in the form of a new generation of packaged applications is still a work in progress. “The processes aren’t yet standardized,” Gilbert said.

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Spark muscling in on Hadoop’s territory, says Wikibon analyst

Bullish

Storage revolution

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

FinOps X 2026

Snowflake Summit 2026

Freshworks Refresh 2026

IBM Think 2026

Dell Technologies World 2026

Spark muscling in on Hadoop’s territory, says Wikibon analyst

Bullish

Storage revolution

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

FinOps X 2026

Snowflake Summit 2026

Freshworks Refresh 2026

IBM Think 2026

Dell Technologies World 2026

Cookies