Databricks releases data-sharing protocol to open source
Databricks Inc. is using its Data + AI Summit today to launch a new open-source project called Delta Sharing that provides an open protocol for securely sharing data across organizations in real time, regardless of the platform on which the data resides.
The company also announced reliability, governance and scalability enhancements to its “lakehouse,” the term it uses for an architecture that combines both data warehouse and data lake workloads without the need for extract databases. And it’s rolling out a new, unified data catalog that it claims makes it easier for organizations to discover and govern all of their data.
Delta Sharing is included within the Delta Lake project, a table storage layer the company released to open source in late 2019. The platform has already garnered support from a broad set of data providers, including Nasdaq Inc., Standard and Poor’s Financial Services LLC, Amazon Web Services Inc., Microsoft Corp., Google LLC and Tableau Software Inc.
Automating a manual process
Databricks said it’s hoping to address the inefficiency of the often manual processes required for organizations to exchange data with customers, partners and suppliers. Data sharing products have historically been tied to a single vendor or commercial product, which limits collaboration between organizations that use different platforms.
“The primary way companies have shared with others is by going through a cumbersome process or by using a rigid existing system that everyone must use,” said Arsalan Tavakoli (pictured), co-founder and senior vice president of field engineering at Databricks.
Joining multiple data sources together is also a chore. “You can’t just give access to everyone,” he said. “You want access controls, auditing and version control. There is no way to do that today.”
Delta Sharing limits vendor lock-in and enables a broader and more diverse set of use cases than has been possible previously, the company said. It establishes a common standard for sharing all data types with an open protocol that can be used in SQL, visual analytics tools and programming languages such as Python and R. Delta Sharing also permits organizations to share existing large-scale datasets in the Apache Parquet and Delta Lake formats in real time without the need for copies.
“You can share everything from a whole table to specific columns or rows to masked personally identifiable information data about a customer,” Tavakoli said. “The audit is the same.” The protocol also provides built-in security controls and permissions that ensure privacy and compliance needs can be met. Each partner can query, visualize and enrich that shared data with their tools of choice.
Delta Sharing is the fifth major open-source project launched by Databricks, following Apache Spark, Delta Lake, MLflow for machine learning, and Koalas, which implements the pandas DataFrame application program interface on top of Spark. The project is being donated to the Linux Foundation.
Simplifying ETL
Lakehouse enhancements are led by Delta Live Tables, a cloud service based on the Databricks platform that makes extract/transform/load capabilities easier on both batch and streaming data, thus improving data quality and consistency. The product is intended to address what is today a mostly manual process of specifying granular instructions that define how data should be manipulated and tested.
“Delta Live Tables is for use when you have data coming in and you want to filter out the garbage,” Tavakoli said. “You don’t want to think about dependencies and restarting stalled pipelines. You just specify what you want and it will take care of everything underneath.”
Users specify the transformations they want using SQL commands, define quality expectations and specify costs, constraints procedures to follow during a failure. “The system can parse and figure out what stages need to run when,” Tavakoli said. “It builds a graph of all of those operations, kicks off the processes and sizes how big the job it’s going to be.” Delta Live Tables can also restart pipelines to resolve transient errors and provide information that helps data engineering teams pinpoint the source of the error.
Easier data sharing
Unity Catalog runs on top of Delta Sharing to enable organizations to manage secure data sharing with business partners and data exchanges. It’s based on ANSI SQL and integrates with existing data catalogs. Several Databricks partners have committed to contribute integrations based on the platform, the company said.
“It works with all frameworks,” Tavakoli said. “Spark is a core piece but we go all the way from R and Python to multi-flow tools like PyTorch,” which is an open-source machine learning framework.
Unity Catalog also works with standard notebooks as well as Databricks’ own notebooks. “It stores all of the runs so you can see the exact code run and the parameters,” he said. “It gives you all the output of the details from cluster configuration to the model to the parameters you use to drive it.”
The company declined to specify pricing information.
Photo: SiliconANGLE
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU