Can Apache Spark live up to the hype?

Spark Summit 2014 provided the latest evidence that the Apache Spark open source data analytics framework may be about to catch fire.

Lauded by SiliconANGLE’s John Furrier as the “next big thing in Big Data”, Spark is turning heads because of its impressive speed. It’s becoming a popular alternative to MapReduce, which is one of the central components in Hadoop. “Spark is a fast data analysis engine,” said Furrier. “Think Hadoop MapReduce, but 100 times faster and still fully interoperable with the wider Hadoop ecosystem.”

Sparked into life


SparkThe biggest news out of Spark Summit week came from Databricks, which is one of the most prominent companies trying to commercialize the framework. Databricks said it raised $33 million in series B funding, and also rolled out a cloud computing service for easily creating, deploying and running Spark workloads.

Speaking at Spark Summit, Databricks Co-founder and CEO Ion Stoica said Databricks Cloud is all about simplifiying the data pipeline process. It does this by combining myriad data stores and data-processing systems (for batch processing, stream processing, interactive SQL and graph processing) into a single management platform under one API.

“Clusters are difficult to set up and manage, and extracting value from your data requires you to integrate a hodgepodge of disparate tools, which are themselves hard to use,” Stoica told Databricks’ value is to combine “the power of Spark with a zero-management hosted platform and an initial set of applications built around common workflows.”

*Databricks keynote at Spark Summit 2014

DataStax has also jumped on the bandwagon, announcing Enterprise 4.5 edition of its Cassandra NoSQL engine, with support for Spark. Stratio, a company that claims to be the “pure Spark Big Data platform”, launched its Stratio Enterprise, which offers batch processing, data streaming and real-time analysis combined with NoSQL database integration and HDFS. Guavus, a Big Data analytics firm, announced version 2.0 of its Reflex operational intelligence program with support for both Apache Spark and Hadoop Yarn.

A bright future ahead?



Spark is best known for its in-memory machine learning capabilities, but it also supports streaming and SQL analysis, and developers are currently working on a way to bring graph analysis and R analytics into its framework. Above all, though, Spark is known for its speed.

The stable release of Spark, v1.0.0, only came out in May, but Spark has actually been around for some time. It was first developed 2009 at UC Berkeley’s AMPLab, before going open-source in 2010, and then becoming an Apache Top-Level Project in February 2014. In that time an impressive crowd of disciples has gathered around it, including companies like Yahoo!, Intel, Adobe, Cloudera, UC Berkeley, and Databricks. The popularity of NoSQL databases is helping it along.

“NoSQL databases are popular because they allow developers to incorporate data of any structure and don’t bind them to a particular data model,” said Wikibon principal researcher Jeff Kelly.  However, he cautioned that “NoSQL databases, in general, are less mature when it comes to analytic capabilities.”

Spark’s appeal isn’t lost on Hadoop providers, though. All of the major vendors have jumped into bed with Databricks to integrate it with their distros. Cloudera was one of the first, announcing a partnership with Databricks back in February. It now offers Spark software and supports production-ready deployments. IBM, MapR, Pivotal and most recently Hortonworks have all since signed up.

Spark still needs a high-scale storage layer to operate, and since it’s strictly for data analysis which means it will never replace Hadoop. But with Databricks cloud platform, it’s possible Spark could replace many of Hadoop’s components, most notable MapReduce. The sheer number of Spark users lends credence to this view – while stream processing and machine learning are the obvious use cases for Spark, its SQL, graph analysis and R-based data mining capabilities are just as powerful.

Not everyone is convinced of Spark’s greatness just yet. In an interview with InformationWeek, MapR CMO Jack Norris acknowledged Spark’s of momentum but warned we’re still in the early days. “Yes, it can do a range of processing, but there are many issues in the framework that limit the use cases,” said Norris. “One example is that it is dependent on available memory; any large dataset that exceeds it will hit a huge performance wall.”

Spark might be immature, but the ambitions of those behind it are clear. What’s also clear is that Spark has a great deal going for it – wide ranging support, including on Amazon Web Services S3 and the Cassandra NoSQL database, a huge developer community, and most of all, the promise of a single, cohesive alternative to the mishmash of data analysis tools that are currently used with Hadoop.

photo credits- Sparks flying: AMERICANVIRUS via photopincc; Sunrise: Ghost Particle via photopincc; light bulb: Josch13 via