Study: Spark is outgrowing, and increasingly displacing, Hadoop
A rift is forming in the open-source ecosystem that may drastically alter the trajectory of modern analytics. Apache Spark, the speedy in-memory data crunching engine developed to take Hadoop beyond batch processing, is increasingly drifting away from the project as new use cases drive early adopters to reconsider their implementation choices.
That’s the key conclusion of the first official annual user survey from Databricks Inc., the startup co-founded by Spark creator Matei Zaharia and several peers from UC Berkeley to commercialize the framework. Some 48 percent of the 1417 data scientists and other participants who partook in the poll said that their organizations have deployed the engine as a standalone cluster.
That compares to the 40 percent whose companies are running Spark on Hadoop, which is not particularly encouraging for Cloudera Inc. and the other distributors that have spent the last few years trying to monetize the latter project. Compounding the threat is the growth that Databricks has recorded in the uptake of the in-memory engine’s value-added extensions.
That includes first and foremost Spark SQL, the structured query component, which the study found to have seen adoption nearly quadruple over the past year from four percent of the overall Spark user base to almost a quarter. The technology substitutes the functionality of Cloudera’s Impala and many other alternatives the Hadoop ecosystem.
Trailing behind in second place is the query layer is Spark Streaming, which jumped a more modest 56 percent and is now seeing use with some 14 percent of the entire user base. That growth will likely expand much further as more and more organizations find themselves needing to process data in real-time due to the proliferation of connected devices in the corporate network.
For the time, being, however, the main reason why CIOs are refocusing their analytics efforts from Hadoop to Spark is its raw speed. An overwhelming 91 percent of the respondents to the survey cited performance as a key advantage of the engine, an edge that will only increase as Databricks continues to optimize the underlying architecture.
But not all the credit goes to the startup, however. Much of the work is done by the surrounding ecosystem of outside contributors, which saw its ranks swell by 600 members in the last 12 months, more than twice as many as the previous year according to the study. Among Spark’s newest backers is IBM Corp., which recently committed a billion dollars and 3,500 engineers to accelerating its development.
That makes it plentifully clear where Big Blue thinks the open-source analytics movement is headed, a sentiment that is shared by even the staunchest Hadoop supporters. Cloudera an initiative to make Spark the new default processing engine of the platform in an effort to capitalize on its popularity, citing many of the same reasons as the respondents to Databricks’ survey.
Photo via sethink
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU