UPDATED 12:00 EDT / JUNE 06 2017

BIG DATA

Databricks updates Apache Spark’s deep learning, streaming capabilities

Databricks Inc. today took some serious steps toward boosting the value proposition of the popular open-source Apache Spark big data processing engine, which is facing potent new competition.

The San Francisco-based company announced updates that include new deep learning and structured streaming capabilities that add more versatility to the platform, as well as a new serverless offering that aims to reduce the complexity and cost of running Spark clusters in the cloud.

“We need much easier-to-use systems,” Matei Zaharia, Databricks’ co-founder and chief technologist, said at a keynote at the company’s Spark Summit conference in San Francisco.

The updates unveiled today come at a time when Apache Spark has established itself as the tech de jour for data scientists running high-performance, real-time and in-memory analytics, James Kobielus, an analyst at Wikibon, owned by the same company as SiliconANGLE, said in an interview. In particular, Spark has rapidly emerged as one of the most popular platforms for in-memory, machine learning, graph and streaming analytics.

“Many solution providers have integrated Spark with their middleware, development tools, data platforms, and applications,” Kobielus said. “Developers increasingly use Spark for high-profile applications, especially those focused on machine learning, big data and real-time streaming analytics.”

That’s not to say Spark is having it all its own way though. In spite of its rapid adoption, the technology is already starting to feel challenged by the rise of new trends around artificial intelligence and machine learning, Holger Mueller, principal analyst and vice president of Constellation Research Inc., said in an interview. He noted that Spark has reached that interesting phase where the technology has matured and grabbed a substantial share of the market, only to start feeling threatened by the emergence of newer technologies.

“Spark is doing well, but now it needs to develop its uptake and positioning strategy,” Mueller said. “How well it does that with new synergistic offerings will decide whether or not it becomes a lasting standard for next-generation applications beyond the initial three- to five-year hype cycle.”

Spark boosts its deep learning chops

As if on cue, Databricks today introduced its new Deep Learning Pipelines feature for Spark that integrates deep learning features and application programming interfaces with the platform’s core codebase for the first time. The idea is make it easier to integrate Spark with popular technologies such as the open-source deep learning framework software TensorFlow.

Deep Learning Pipelines introduces an application programming interface that allows for easy access to deep learning libraries in TensorFlow and Keras, another big data framework. With this, data scientists can now combine Spark’s data processing capabilities with one of the most popular deep learning frameworks around and use it to train and scale new models for a wide range of use cases.

Zaharia said today that Databricks aims to make it possible for 10 times more people to create deep learning applications by making the whole process easier.

“Spark is playing a significant, growing and occasionally unsung role in the deep learning revolution,” Kobielus said. “However, deep learning features and APIs have not been integrated with the core open-source Apache Spark code base, so we’re encouraged by Databricks’ announcement.”

Databricks also announced a new Databricks Serverless offering, which the company said is the first fully managed platform for Spark. The solution is aimed at developers in startups primarily working in cloud environments.

“For these developers, many of whom are in startups that lack deep IT resources, it’s absolutely essential to have a fully managed cloud-based development environment in which the service provider automatically handles cluster configuration, optimization, security, and workload management,” Kobielus said.

Finally, Databricks announced the general availability of a high-level API for Structured Streaming, which delivers significantly lower latency and higher throughput than before. This is an important update because Spark is currently facing stiff competition from open-source technologies such as Apache Flink and Apache Beam when it comes to streaming data.

All in all, today’s updates bode well for Apache Spark, Kobielus said. He noted that Spark continues to gain mainstream adoption in diverse application domains, primarily focused on big data, data science, advanced analytics and the Internet of Things. The platform is also establishing itself as a useful but complementary tool in the development and training of deep learning models, and these strengths should ensure Spark remains relevant for some time to come, Kobielus said.

Still, Databricks’ announcements today failed to address its in-memory data processing capabilities, which Mueller said was Spark’s biggest strength but also its biggest weakness. He said there’s a risk that Spark could become less important in future as business-relevant data is growing at a much faster pace than in-memory prices are falling.

The danger, Mueller said, is that the costs of in-memory processing could see Spark’s importance decline to a point where it’s viewed as little more than a “simple cache” that does not have the standing in the market with regards to monetization. It’s a problem that will become more pronounced with the rise of data-intensive workloads like deep learning if Spark is unable to keep up with the pace.

“This becomes even more relevant in the age of self-learning and machine learning algorithms, which need big-data-sized storage to trigger constant learning,” Mueller said.

With reporting from Robert Hof

Image: Apache Software Foundation

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU