UPDATED 01:10 EDT / JUNE 07 2017

BIG DATA

Analysis: At Spark Summit, Databricks pushes Apache Spark where it needs to go

Invented eight years ago and intensively commercialized over the past several years, Apache Spark has become a core power tool for data scientists and other developers working sophisticated projects in machine learning, continuous stream computing and graph analytics. The open-source codebase’s worldwide customer base now includes more than 225,000 users, and it’s expanding rapidly.

However, for all its success in gaining mainstream adoption, Spark has begun to lose its “next big thing” in the data science community and in big-data analytics generally. Over the past year, that status has shifted to something that has been around for a very long time: artificial intelligence.

More specifically, the next big thing, as judged by how thoroughly it has permeated the popular imagination, is that sophisticated cousin of machine learning known as deep learning. Just as Spark had superseded Hadoop as the buzzy preoccupation of data scientists and big-data analytics professionals, it was beginning to seem as if Spark itself were in danger of being eclipsed in this new era of recurrent, convolutional and other multilayered neural networks that fly the banner of deep learning.

But this week’s announcements show that Spark is far from irrelevant in the era of AI and deep learning. Spark Summit 2017 in San Francisco Tuesday featured several important announcements by Databricks that point the way toward the community’s evolution into an integral component of the burgeoning AI/deep learning ecosystem. Beyond that, Databricks’ announcement show that it—the primary developer, committer and visionary vendor in the Spark arena—is actively engaged in growing the core codebase’s functionality, performance, scalability, manageability and ease of use and development.

Here are Databricks’ principal announcements and how they push Spark where it needs to go to stay relevant and deliver deeper value in open analytics ecosystems:

Pushing Spark more deeply into deep learning: Spark plays a significant, growing and occasionally unsung role in the deep learning revolution. As used in environments such as those I discussed in this article, deep learning developers quickly train and deploy multilayered neural nets using libraries and compute clusters that are already at their disposal. At Spark Summit 2017, Databricks announced Deep Learning Pipeline, which provides an application programming interface to enable simplified programmatic access to TensorFlow and Keras libraries from within Spark’s MLlib Pipelines.

One key innovation is the Deep Learning Pipelines’ ability to expose functions though SQL to make them available to the broader analytics developer community. Wikibon hopes that Databricks will follow up quickly by adding API access to other leading open-source deep-learning tools–especially Caffe2, Theano and MXNet. And we urge Databricks, as one of the prime Spark committers, to rapidly contribute this deep-learning integration codebase to the Apache Spark community.

Pushing Spark more thoroughly into DevOps: Spark is the foundation of many innovative application development initiatives involving machine learning, continuous computation and graph analysis. More Spark applications are being developed for mission-critical enterprise apps, which requires that data scientists incorporate DevOps practices for continuous building, testing and deployment of their apps. For these developers–many of whom are in startups that lack deep information technology resources–it’s absolutely essential to have a fully managed cloud-based development environment in which the service provider automatically handles cluster configuration, optimization, security and workload management.

In that regard, Wikibon lauds the announcement of Databricks Serverless as a DevOps foundation for enterprise developers. However, it’s not clear how the new offering will fare in a market where Microsoft Corp., IBM Corp. and others already provide robust multitenant cloud-based Spark DevOps environments.

Pushing Spark into true streaming territory: Spark is not the only open-source environment for streaming analytics. Spark’s Structured Streaming has been on the market since v2.0 in last year is facing stiffer competition in the streaming arena from established alternatives such as Flink and Beam. Heretofore, a key disadvantage competitively for Structured Streaming has been its microbatching architecture, which limits its ability to push the performance in continuous computation and low-latency analytics.

To keep Structured Streaming in the competitive race, Databricks has had to continue boosting its performance, such as through the latest announcement of a high-level API that it claims significant boosts Spark’s Structured Streaming throughput on its cloud platform. That’s in addition to new code it’s contributing to Apache Spark that it claims lowers the service’s end-to-end latency to the sub-millisecond range. Databricks’ investments in reducing Spark’s streaming latency is table stakes to compete in this hotly competitive segment.

Photo: Robert Hof

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Analysis: At Spark Summit, Databricks pushes Apache Spark where it needs to go

Photo: Robert Hof

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

Pure Accelerate 2026

FinOps X 2026

Snowflake Summit 2026

Freshworks Refresh 2026

IBM Think 2026

Analysis: At Spark Summit, Databricks pushes Apache Spark where it needs to go

Photo: Robert Hof

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

Pure Accelerate 2026

FinOps X 2026

Snowflake Summit 2026

Freshworks Refresh 2026

IBM Think 2026

Cookies