UPDATED 09:00 EDT / JUNE 09 2022

BIG DATA

Databricks adds data lineage feature to its catalog with support for nontraditional uses

Databricks Inc. today is adding data lineage features to its Unity Catalog governance platform, a move that it says significantly expands data governance capabilities on the hybrid data warehouse or data lake that it calls a lakehouse.

Data lineage describes how data flows throughout an organization, giving customers the ability to see where lakehouse data came from, who created it and when, how it was modified over time and how it’s currently being used, among other features. The feature is now available for preview on the Amazon Web Services Inc. and Microsoft Corp. Azure clouds.

The feature helps organizations cope with the growing volume and variety of data coming in from multiple sources, how it moves and changes, who has access to it and how it’s used. Databricks says it’s bringing an updated approach to the process and that adding the feature required modifying the core database engine to accommodate nonstandard use cases such as machine learning models.

“Understanding how data flows through the organization is fundamental to being able to trust your data,” said Joel Minnick, Databricks’ vice president of marketing. “We’re going back to the core principle of the Unity Catalog, which is not just trying to govern tables and files but also modern assets like dashboards, notebooks and models.”

Lifecycle view

Data lineage enables data management teams to see all downstream functions that are affected by data changes — including applications, dashboards, machine learning models and data sets — and understand the severity of the impact so stakeholders can be notified. “The minute data comes into the lakehouse, we start to track it,” Minnick said. Metadata that travels with data elements such as the author and creation date are also imported.

The feature also helps organizations better meet compliance rules because of better traceability, Databricks said. “We capture all the data we can see at a pretty fine-grained level of detail: who created it, what changes were made, when was it changed, what pipelines it was used in and who has access to it,” Minnick said. “Ultimately, if you share that data, we can also see who it is shared with.”

Data lineage enables data consumers such as data scientists, data engineers and data analysts to conduct context-aware analysis. Data stewards can see which data sets are no longer accessed or have become obsolete so stale or unnecessary data can be removed to improve overall data quality.

Key features of Unity Catalog include automated run-time lineage to capture all lineage generated in Databricks, which provides more accuracy and efficiency compared to manual tagging. Information is captured for tables, views and columns to give a granular picture of upstream and downstream data flows. Additionally, lineage works across all languages supported by Databricks — including SQL, Python, R and Scala – as well as notebooks, workflows and dashboards.

Databricks aims to make the capability available across all the cloud platforms it supports, Minnick said.

Photo: Robert Hof/SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU