Analysis: At Spark Summit, Databricks pushes Apache Spark where it needs to go
Invented eight years ago and intensively commercialized over the past several years, Apache Spark has become a core power tool for data scientists and other developers working sophisticated projects in machine learning, continuous stream computing and graph analytics. The open-source codebase’s worldwide customer base now includes more than 225,000 users, and it’s expanding rapidly.
However, for all its success in gaining mainstream adoption, Spark has begun to lose its “next big thing” in the data science community and in big-data analytics generally. Over the past year, that status has shifted to something that has been around for a very long time: artificial intelligence.
More specifically, the next big thing, as judged by how thoroughly it has permeated the popular imagination, is that sophisticated cousin of machine learning known as deep learning. Just as Spark had superseded Hadoop as the buzzy preoccupation of data scientists and big-data analytics professionals, it was beginning to seem as if Spark itself were in danger of being eclipsed in this new era of recurrent, convolutional and other multilayered neural networks that fly the banner of deep learning.
But this week’s announcements show that Spark is far from irrelevant in the era of AI and deep learning. Spark Summit 2017 in San Francisco Tuesday featured several important announcements by Databricks that point the way toward the community’s evolution into an integral component of the burgeoning AI/deep learning ecosystem. Beyond that, Databricks’ announcement show that it—the primary developer, committer and visionary vendor in the Spark arena—is actively engaged in growing the core codebase’s functionality, performance, scalability, manageability and ease of use and development.
Here are Databricks’ principal announcements and how they push Spark where it needs to go to stay relevant and deliver deeper value in open analytics ecosystems:
Pushing Spark more deeply into deep learning: Spark plays a significant, growing and occasionally unsung role in the deep learning revolution. As used in environments such as those I discussed in this article, deep learning developers quickly train and deploy multilayered neural nets using libraries and compute clusters that are already at their disposal. At Spark Summit 2017, Databricks announced Deep Learning Pipeline, which provides an application programming interface to enable simplified programmatic access to TensorFlow and Keras libraries from within Spark’s MLlib Pipelines.
One key innovation is the Deep Learning Pipelines’ ability to expose functions though SQL to make them available to the broader analytics developer community. Wikibon hopes that Databricks will follow up quickly by adding API access to other leading open-source deep-learning tools–especially Caffe2, Theano and MXNet. And we urge Databricks, as one of the prime Spark committers, to rapidly contribute this deep-learning integration codebase to the Apache Spark community.
Pushing Spark more thoroughly into DevOps: Spark is the foundation of many innovative application development initiatives involving machine learning, continuous computation and graph analysis. More Spark applications are being developed for mission-critical enterprise apps, which requires that data scientists incorporate DevOps practices for continuous building, testing and deployment of their apps. For these developers–many of whom are in startups that lack deep information technology resources–it’s absolutely essential to have a fully managed cloud-based development environment in which the service provider automatically handles cluster configuration, optimization, security and workload management.
In that regard, Wikibon lauds the announcement of Databricks Serverless as a DevOps foundation for enterprise developers. However, it’s not clear how the new offering will fare in a market where Microsoft Corp., IBM Corp. and others already provide robust multitenant cloud-based Spark DevOps environments.
Pushing Spark into true streaming territory: Spark is not the only open-source environment for streaming analytics. Spark’s Structured Streaming has been on the market since v2.0 in last year is facing stiffer competition in the streaming arena from established alternatives such as Flink and Beam. Heretofore, a key disadvantage competitively for Structured Streaming has been its microbatching architecture, which limits its ability to push the performance in continuous computation and low-latency analytics.
To keep Structured Streaming in the competitive race, Databricks has had to continue boosting its performance, such as through the latest announcement of a high-level API that it claims significant boosts Spark’s Structured Streaming throughput on its cloud platform. That’s in addition to new code it’s contributing to Apache Spark that it claims lowers the service’s end-to-end latency to the sub-millisecond range. Databricks’ investments in reducing Spark’s streaming latency is table stakes to compete in this hotly competitive segment.
Photo: Robert Hof
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU