Databricks’ Delta Lake 3.0 bridges compatibility gaps with Apache Iceberg and Hudi
Databricks Inc. today released the latest version of Delta Lake, the storage framework that it donated to open source a year ago.
Version 3.0 adds support for the Apache Iceberg and Apache Hudi data lake platforms using a universal format that allows data stored in Delta Lake to be read from either source. The move is intended to simplify the often complicated integration work required in building a lakehouse, which is an open, hybrid architecture that combines elements of both a data warehouse and a data lake.
The market for lakehouses is crowded and fast-growing. Although no lakehouse-specific forecasts could be found, SNS Insider Pvt Ltd. estimates that the data lake market was valued at just over $12 billion last year and is expected to grow more than 21% annually, to $57 billion by 2030. Databricks said Delta Lake is the most widely used lakehouse storage format in the world, with more than 1 billion downloads per year.
Metadata mismatch
Iceberg and Hudi are two of the most popular open-source lakehouse options. They and Delta Lake work with the Apache Parquet open-source format but “they all generate different metadata,” said Databricks Marketing Vice President Joel Minnick. “How you interact with that metadata affects the type of connectors in the engines that connect to those platforms. We could end up in a format war that slows down the adoption of the lakehouse because we’ve created different ecosystems.”
Delta Lake 3.0 can generate metadata automatically in all three formats and understands the source used by connectors. “By building for Delta Lake, you can build for every platform,” Minnick said.
Data stored in Delta Lake can now be read from as if it were Iceberg or Hudi. Databricks’ UniForm universal format automatically generates the metadata needed for Iceberg or Hudi so manual conversion between the formats isn’t needed.
A component called Delta Kernel provides a single stable application program interface for connectors that bridge different data management engines. Connectors that are built against a core Delta library and that implement Delta specifications don’t need to be updated with each new version or protocol change, the company said.
A new layout called Liquid Clustering provides cost-efficient data clustering as data grows to help ensure that read and write performance requirements are met, Databricks said.
Delta Lake also supports Delta Sharing, an open protocol for secure data exchange that the company says is used by more than 6,000 data consumers.
Databricks is advocating for the Hudi and Iceberg communities to adopt its approach. “Customers use all of these different systems and they are asking for ways to make the translation between all these different systems much easier,” Minnick said. “By making the format effectively irrelevant, adoption of the lakehouse can be rapidly accelerated.”
Photo: Niklas Tschöpe/Wikimedia Commons
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU