Spark muscling in on Hadoop’s territory, says Wikibon analyst
Is Apache Spark the successor to Hadoop?
Some people think so. Given the batch-oriented Hadoop’s complexity and the notorious performance problems of the MapReduce processing framework and the components that depend upon it, the integration and speed that the in-memory Spark analytics engine brings to the table has a lot of appeal.
The big news in big data for the last 12 months has all been about real-time and streaming analytics. Spark, which became a top-level Apache project in 2014, has gained endorsements from every major big data player. A new player that’s been stirring up excitement is Apache Flink, a true stream processing engine whose developers recently landed $6 million in financing for their Flink-focused startup, Data Artisans.
In a recent interview with SiliconANGLE, Data Artisans CEO Kostas Tzoumas flatly declared that streaming analytics will make batch analytics all but obsolete in a few years.
But Wikibon analyst George Gilbert doesn’t see wholesale change in the offing. Business processes evolve a lot more slowly than technology, he noted. “Take the apps that have dominated computing for 50 or 60 years like payroll. Much of it is batch,” Gilbert said. “What isn’t batch is request/response, which is another way of describing end-users interacting with an application using a screen. For streaming to blow that away completely would take a long time.” Internet-centric companies are adopting streaming analytics faster than the rest of the market, he added.
That doesn’t mean Gilbert isn’t bullish on the potential of Spark in particular and streaming analytics in general. Quite the contrary. In a recent research note, he predicted that Spark will be the principal driver of the big data market’s growth over the next decade.
“By 2026, 59 percent of all big data spending will be tied to Spark or related streaming analytics as enterprises seek to deploy applications that can make decisions on behalf of individuals,” he wrote.
The reasons relate both to speed and simplicity. Hadoop introduced important concepts like moving analytics engines close to the data and incorporating unstructured data into a data lake, but it has developed incrementally into an ecosystem of more than 30 discrete components that can be daunting to coordinate.
“We thought we were at this big data nirvana, but it turns out this stuff is ferociously complex,” he said. “Rather than a mix and match approach that’s become like a zoo to manage, Spark wrapped a lot of the components up in one-easy-to-administer and easy-to-program package.”
Gilbert foresees Spark taking the place of many of the analytical tools that enterprises are now using. The reasons have as much to do with simplicity as performance. “It’s not just an integrated execution engine for developers. The Spark cluster is all the administrator has to deal with,” he said.
Hadoop is still needed. One fundamental reason is that Spark lacks a storage component, so technologies like HBase and Hadoop’s HDFS file system fill that gap. An interesting new player in the storage equation is Apache Kafka, a high-speed message queuing system that actually plays the role of a file system for stream processors.
“Kafka keeps track of everything as long as you want, so a streaming engine can dip in and get what it needs,” Gilbert said. “It’s like a file system for stream processing.”
Flink is another factor. It’s a true real-time analytics engine that is better suited than Spark for applications that demand low-latency continuous processing, such as monitoring sensor data.
Spark has added real-time-like features through the Spark Streaming project, but it’s still fundamentally a “micro-batch” architecture for now, meaning that it simulates real-time analytics by processing small volumes of data quickly in batch mode. For most applications Spark is good enough, but true stream processing will demand a combination of Flink and Kafka unless Spark is able to evolve beyond its micro-batch approach to add per-event streaming.
Gilbert sees great promise in in-memory analytic engines because, “We are on the cusp of a major transformation in servers where much of storage will actually become a form of memory” in the form of flash and other new technologies. “That means it will easy to do more things with Flink and Spark that you would have previously done in batch mode.” Kafka could become the ingest layer and Hadoop could house the data lake.
This new, more interactive framework will be used differently than big data tools were in the past. Gilbert calls them “Systems of Intelligence.”
“Data warehousing applications gave you a definitive answer to the question you asked,” he said. “In the future you’ll get a best guess or most likely recommendation for what should happen next.” But just how this more predictive approach to analytics will play out in the form of a new generation of packaged applications is still a work in progress. “The processes aren’t yet standardized,” Gilbert said.
Since you’re here …
Show your support for our mission with our one-click subscription to our YouTube channel (below). The more subscribers we have, the more YouTube will suggest relevant enterprise and emerging technology content to you. Thanks!
Support our mission: >>>>>> SUBSCRIBE NOW >>>>>> to our YouTube channel.
… We’d also like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.