UPDATED 14:38 EDT / JUNE 05 2018

CLOUD

Databricks goes well beyond Spark into complex, multicloud AI pipelines

Apache Spark was the pinnacle of advanced analytics just a few years ago. As the primary developer of this technology, Databricks Inc. has played a key role both in its commercial adoption, in the evolution of the community’s underlying open-source codebase, and in pushing Spark-based machine learning and streaming into the mainstream of enterprise computing.

However, as TensorFlow and other deep learning and artificial intelligence technologies have risen to the top of the open analytics stack, Apache Spark has begun to seem like it was losing steam in terms of innovation and adoption. This trend was starting to make the core Databricks ML technology — Apache Spark’s MLLib — seem a bit behind the times, though in fact, its adoption has continued to grow among working data scientists.

Nevertheless, Databricks rose to the challenge and last year, at what was then called Spark Summit, made several important announcements that extended its Spark-based Unified Analytics Platform to stay relevant in this new era:

  • Greater DL support in Spark-based ML pipelines: Databricks launched Deep Learning Pipeline, which provides an application programming interface to enable simplified programmatic access to TensorFlow and Keras libraries from within Spark’s MLlib Pipelines. This tool exposes ML functions though SQL to make them available to the broader analytics developer community. Within Databricks Workspace, this empowers data scientists and data engineers to collaborate around a common platform that is tightly integrated with Spark clusters.
  • Fully managed Spark-based ML pipeline in a serverless cloud: Databricks launched a fully managed cloud-based ML-development environment in which the service provider automatically handles cluster configuration, optimization, security and workload management. Databricks Serverless provides a managed DevOps foundation for enterprise ML developers. Within the Databricks Cloud Service, this automates and simplifies DevOps by abstracting the complexity of the data infrastructure by auto-configuring and auto-scaling clusters. It also provides enterprise grade security and compliance.
  • Higher performance structured streaming in a Spark-based ML pipeline: Databricks introduced a high-level API that boosts the throughput of Spark’s Structured Streaming on the vendor’s cloud platform, as well as code that it contributed to Apache Spark to lower Structured Streaming’s end-to-end latency into the sub-millisecond range.

Now let’s cut ahead to the present, in which AI’s marketplace momentum has picked up and more companies have introduced strong end-to-end DevOps tools for cloud-based DL. Already in the past year, we’ve seen powerhouse providers such as Amazon Web Services Inc. and Oracle Corp. get into this segment, while a growing number of startups have come to market with strong ML-pipeline automation tools. Today, at what’s now called Spark + AI Summit, Databricks announced new data-science DevOps capabilities that address the need for tools that can encompass end-to-end workflows across increasingly complex AI pipelines and multi-cloud environments.

Extending but going well beyond its core Spark offerings, Databricks has introduced the following new AI pipeline capabilities into its Unified Analytics Platform:

  • Scalable data preparation for AI modeling: The new Databricks Delta, which will be available by the end of this month, simplifies large-scale data engineering and management, ensuring reliable data management through built-in transactional integrity on batch and streaming data. Databricks claims that Delta, which is now part of the Spark-based Databricks Runtime, improves extract/transform/load operations and other data-preparation performance up to 100 times through caching and indexing capabilities, speeding data scientists’ access to reliable datasets at scale.
  • Simplified AI modeling and training within popular frameworks: The new Databricks Runtime For ML provides preconfigured modeling and distributed training environments. These are tightly integrated with such popular AI frameworks and libraries as TensorFlow, Scikit-Learn, Keras and XGBoost, thereby accelerating environment provisioning and configuration management. The new HorovodEstimator is an MLlib-style application programming interface for distributed, multiGPU DL training on Spark DataFrames, which streamlines an end-to-end DL pipeline from Spark-based ETL for data preparation all the way through to TensorFlow-based model training. A unified engine in Runtime for ML supports simplified access to GPUs in the AWS and Microsoft Azure clouds for parallelized ML/DL training, evaluation and deployment. And it enables unified provisioning of a Databricks cluster with the many libraries required for distributed ML/DL training, thereby decreasing cluster startup time.
  • Agile operationalization of AI into multiclouds: The new Databricks MLflow, currently in alpha, provides an open source toolkit for simplifying ML modeling, training and operationalization across multiclouds. It integrates closely with Apache Spark, SciKit-Learn, TensorFlow and other open-source frameworks. As a DevOps platform for data scientists, it supports packaging of ML and DL models for repeatable and reproducible runs. It also supports execution and comparison of hundreds of models runs in parallel, can execute ML modeling and deployment on any hardware or software platform, and operationalizes models to diverse server platforms across various cloud environments. And it’s built around REST APIs and simple data formats that allow models to be invokable as lambda functions, even from other ML-driven apps, in serverless cloud fabrics.

Impressive as these new offerings are, they don’t signal any clear differentiation for Databricks in a market that has become crowded with AI DevOps, modeling and training tools. On the low-latency side of big-data analytics, Kafka’s and Flink’s customer bases continue to expand, buoyed by the fact that they are true streaming fabrics that aren’t weighed down by Spark Structured Streaming’s legacy in microbatch architecture. And all of the major public-cloud service providers now offer rich AI DevOps tooling as well as sophisticated stream computing capabilities and serverless interfaces.

Spark is feeling increasingly like a legacy technology in many AI shops. In highlighting its pioneering role in the Spark market, it takes the same risk that Cloudera Inc. and Hortonworks Inc. assume when they tout their Hadoop legacy. More of their customers have moved beyond those open-source technologies, though those codebases may hold onto some important use cases that aren’t disappearing any time soon.

Here’s an interview with Databricks Chief Technologist Matei Zaharia on theCUBE, SiliconANGLE’s livestreaming studio, at last year’s Spark Summit:

Photo: Robert Hof/SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU