UPDATED 10:55 EDT / JUNE 06 2016

NEWS

#SparkSummit West 2016 preview: The power of 2.0

Spark Summit West kicks off in San Francisco today, and you can bet the buzz will be all about Spark 2.0.

DataBricks Inc., the principal curator of Spark, was impeccable in its timing, announcing the successor to the current version 1.6 just three weeks ago after teasing it for three months. DataBricks may be looking to rekindle some buzz after the barnburner year Spark had in 2015, when it gained endorsements from every major Hadoop vendor and a commitment by IBM to train 3,500 of its people to use the analytics platform.

Things have been quieter since then, but it would be difficult to maintain the 2015 pace. That’s why version 2.0 is hitting at just the right time, as thousands of Big Data-holics gather by the Bay. SiliconANGLE’s theCUBE will be there, live-streaming interviews and presentations Tuesday and Wednesday. Get the lowdown here.

“Spark is on a roll in the mainstream of Big Data,” said Wikibon Analyst George Gilbert. “It’s taking over more and more workloads of all kinds in the Hadoop ecosystem, although it’s not replacing Hadoop.”

Gilbert is bullish on Spark. In a recent report he forecast that the framework will account for more than one third of all Big Data spending by 2022. Driving its momentum will be increased interest in “continuous, real-time processing of vast streams of data”, of which Spark will be a “crucial catalyst,” Gilbert wrote. Even with new alternatives in the market, there still plenty of room for everyone.

Version 2.0 is likely to be a crowd-pleaser. It builds upon Spark’s as ease-of-use with improved SQL support and aims to delight developers with a unified DataFrame/Dataset API, which makes it easier for programmers to apply Spark to other applications. Databricks has more detail on its blog.

Alternatives emerge

The stream-processing revolution that Spark kicked off has also given rise to some rival services, each with its own strengths and nuances. In April, the Apex open-source stream and batch processing platform was granted top-level status by the Apache Software Foundation. Apex runs in memory, is compatible with the Hadoop Distributed File System (HDFS) and YARN and offers enhanced event processing and fault tolerance capabilities. Whereas Spark’s approach to streaming is rooted in its batch origins, Apex is a true real-time analytics engine. In March, the creators of Apache Flink raised $6 million to build a commercial version of that real time in-memory processing engine.

A rival platform from Google, called Cloud Dataflow, also caught some buzz after outperforming Spark in a recent benchmark. The platform’s API was recently accepted as an incubator project by the Apache Software Foundation under the name Apache Beam. While that’s no guarantee of mainstream acceptance, it does indicate that the stream processing market is likely to get more crowded – and confusing.

Meanwhile, Spark isn’t standing still. In addition to announcing version 2.0, Databricks has said it’s working to integrate Apache Arrow into Spark. Arrow, which itself reached top-level status in February, boasts a ten- to 100-fold speed boost by applying columnar in-memory analytics, which sorts through data sets in columns rather than the slower row-by-row approach.

Databricks is also working on streaming applications that incorporate online machine learning, though it’s not yet clear when that technology could arrive, Gilbert said. “Spark is so early in its lifecycle and customer adoption that it has plenty of runway to continue its ascent,” he said. “There’s lots of unfinished work in the pipeline that should keep it relevant for a long time.”

Enthusiasts aplenty

Boosters continue to be enthusiastic. IBM has about 15 shipping products that leverage the analytics framework and over a dozen more in the works, said Anjul Bhambhri, VP of engineering for big data and analytics at IBM, at Spark Summit East in February. Big Blue today will make “important announcements for ensuring that R, Spark and open data science continue to drive innovative business applications,” wrote Big Data Evangelist James Kobielus on the Big Data & Analytics Hub. IBM is also hosting a Spark Maker Community Event this evening in San Francisco.

MapR Technologies Inc. kicked off the conference this morning by announcing a new Spark distribution that includes the complete Spark stack along with a collection of technologies from its own platform that it said enable Spark to support all major types of advanced analytics, including batch processing, machine learning, procedural SQL and graph computation. You can no doubt expect other news out of the conference, although vendors are keeping their plans close to the vest for now.

Speakers from Amazon Web Services LLC, Nasdaq Inc., Netflix Inc., Intel Corp. and Capital One Financial Corp. will talk about how they’re using Spark in applications ranging from recommendation engines to credit card fraud prevention. Doug Cutting, the principal architect of Hadoop, makes a brief appearance on Wednesday to discuss how Spark represents an evolution from the MapReduce distributed processing framework.

Keep your ears open for tomorrow’s presentation by Google Senior Fellow Jeff Dean, who is scheduled to discuss “Large-Scale Deep Learning with TensorFlow.” Wikibon’s Gilbert said he was “astonished” to see Dean on the agenda. “He is responsible for just about all of Google’s most high-profile database and machine learning products,” Gilbert said. Which could mean those machine learning breakthroughs are closer than we think.

Photo via Spark Summit on Facebook

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU