Spark muscling in on Hadoop’s territory, says Wikibon analyst
Is Apache Spark the successor to Hadoop?
Some people think so. Given the batch-oriented Hadoop’s complexity and the notorious performance problems of the MapReduce processing framework and the components that depend upon it, the integration and speed that the in-memory Spark analytics engine brings to the table has a lot of appeal.
The big news in big data for the last 12 months has all been about real-time and streaming analytics. Spark, which became a top-level Apache project in 2014, has gained endorsements from every major big data player. A new player that’s been stirring up excitement is Apache Flink, a true stream processing engine whose developers recently landed $6 million in financing for their Flink-focused startup, Data Artisans.
In a recent interview with SiliconANGLE, Data Artisans CEO Kostas Tzoumas flatly declared that streaming analytics will make batch analytics all but obsolete in a few years.
But Wikibon analyst George Gilbert doesn’t see wholesale change in the offing. Business processes evolve a lot more slowly than technology, he noted. “Take the apps that have dominated computing for 50 or 60 years like payroll. Much of it is batch,” Gilbert said. “What isn’t batch is request/response, which is another way of describing end-users interacting with an application using a screen. For streaming to blow that away completely would take a long time.” Internet-centric companies are adopting streaming analytics faster than the rest of the market, he added.
Bullish
That doesn’t mean Gilbert isn’t bullish on the potential of Spark in particular and streaming analytics in general. Quite the contrary. In a recent research note, he predicted that Spark will be the principal driver of the big data market’s growth over the next decade.
“By 2026, 59 percent of all big data spending will be tied to Spark or related streaming analytics as enterprises seek to deploy applications that can make decisions on behalf of individuals,” he wrote.
The reasons relate both to speed and simplicity. Hadoop introduced important concepts like moving analytics engines close to the data and incorporating unstructured data into a data lake, but it has developed incrementally into an ecosystem of more than 30 discrete components that can be daunting to coordinate.
“We thought we were at this big data nirvana, but it turns out this stuff is ferociously complex,” he said. “Rather than a mix and match approach that’s become like a zoo to manage, Spark wrapped a lot of the components up in one-easy-to-administer and easy-to-program package.”
Gilbert foresees Spark taking the place of many of the analytical tools that enterprises are now using. The reasons have as much to do with simplicity as performance. “It’s not just an integrated execution engine for developers. The Spark cluster is all the administrator has to deal with,” he said.
Storage revolution
Hadoop is still needed. One fundamental reason is that Spark lacks a storage component, so technologies like HBase and Hadoop’s HDFS file system fill that gap. An interesting new player in the storage equation is Apache Kafka, a high-speed message queuing system that actually plays the role of a file system for stream processors.
“Kafka keeps track of everything as long as you want, so a streaming engine can dip in and get what it needs,” Gilbert said. “It’s like a file system for stream processing.”
Flink is another factor. It’s a true real-time analytics engine that is better suited than Spark for applications that demand low-latency continuous processing, such as monitoring sensor data.
Spark has added real-time-like features through the Spark Streaming project, but it’s still fundamentally a “micro-batch” architecture for now, meaning that it simulates real-time analytics by processing small volumes of data quickly in batch mode. For most applications Spark is good enough, but true stream processing will demand a combination of Flink and Kafka unless Spark is able to evolve beyond its micro-batch approach to add per-event streaming.
Gilbert sees great promise in in-memory analytic engines because, “We are on the cusp of a major transformation in servers where much of storage will actually become a form of memory” in the form of flash and other new technologies. “That means it will easy to do more things with Flink and Spark that you would have previously done in batch mode.” Kafka could become the ingest layer and Hadoop could house the data lake.
This new, more interactive framework will be used differently than big data tools were in the past. Gilbert calls them “Systems of Intelligence.”
“Data warehousing applications gave you a definitive answer to the question you asked,” he said. “In the future you’ll get a best guess or most likely recommendation for what should happen next.” But just how this more predictive approach to analytics will play out in the form of a new generation of packaged applications is still a work in progress. “The processes aren’t yet standardized,” Gilbert said.
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU