UPDATED 23:33 EST / JULY 27 2016

NEWS

Databricks ships out “easier, faster, smarter” Apache Spark 2.0

The immensely popular open-source cluster computing framework Apache Spark has just reached version 2.0, according to an announcement by the Apache Software Foundation (ASF) yesterday.

Spark’s incredible popularity means it’s become one of the most active open-source Big Data projects of them all, approaching the same level as Apache Hadoop, one of the oldest and most established Big Data technologies around. Much of Spark’s acclaim comes due to the superior functionality it offers over MapReduce, the original Hadoop component that it’s now rapidly replacing. Spark supports numerous modern features not seen in MapReduce, such as real-time analytics of streaming data, in-memory processing, machine learning, interactive queries and more.

Now, with Spark 2.0, that functionality has improved even further.

“Apache Spark 2.0.0 is the first release on the 2.x line,” noted the ASF on the Apache Spark website. “The major updates are API usability, SQL 2003 support, performance improvements, structured streaming, R UDF support, as well as operational improvements. In addition, this release includes over 2,500 patches from over 300 contributors.”

But Databricks Inc., the company founded by Spark’s creators to try and commercialize the platform, framed the improvements as the platform’s “three core attributes” – easier, faster, smarter. It made the announcement in a blog post saying Databricks is the first commercial vendor to support Apache Spark 2.0.

In a separate blog post, Databricks explained some of the most notable new features in the release, which focus on two specific areas – standard SQL support and unifying DataFrame/Dataset API.

First up, Databricks has streamlined Spark’s APIs in the new release, unifying its DataFrame and Dataset APIs in Java and Scala. Also streamlined is the DataFrame API, which is now a type alias for Dataset of Row in Spark 2.0. In addition, the new release comes with expanded SQL support, together with the introduction of a new ANSI SQL parser and subqueries, which refers to queries nested inside another query.

The other main focus in Spark 2.0 was speed. Databricks points to its 2015 Spark Survey, which showed that 91 percent of users rated performance as one of the most important aspects of the software. Responding to this feedback, Databricks took a long, hard look at Spark’s physical execution layer, before redesigning and introducing a second-generation Tungsten engine. The new and improved engine “builds upon ideas from modern compilers and MPP databases and applies them to Spark workloads,” the company said.

Spark 2.0 also comes with a brand new API called Structured Streaming that’s designed to allow applications to make decisions in real-time. Structured Streaming has three main improvements, including integrated APIs with batch jobs, transactional interaction with storage systems and rich integration with Spark’s other components. Spark 2.0 ships with the initial alpha release of Structured Streaming as an extension of the DataFrame and Dataset APIs.

Databricks reckons that with the new improvements, developers will no longer need to keep their apps in sync with batch jobs or manage failures manually, as the streaming job will now always give the same answer as a batch job on the same dataset. In addition, developers can now build complete applications rather than just streaming pipelines.

“One of the things that’s really exciting for me as a developer of Apache Spark is seeing how quickly users start to use new features and APIs we introduce, and in turn, offer almost instantaneous feedback, so that we can continue to improve them,” said Matei Zaharia, CTO and co-founder of Databricks and creator of Apache Spark, in a press release.

On its Spark site, the ASF took pains to point out some essential resources for developers wishing to learn more about Spark, including Scala resources such as “First Steps to Scala,” “Scala tutorial for Java programmers” and “Programming in Scala.” There’s also a general “Spark Programming Guide” with examples of code in all three main languages.

Image credit: Mikegi via pixabay

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU