UPDATED 12:00 EST / APRIL 24 2019

BIG DATA

Databricks wants to replace messy data lakes with more reliable ‘Delta Lake’

Big data firm Databricks Inc. wants to clean up companies’ messy data lakes with a new open-source project.

Delta Lake, as the project is called, acts similarly to a regular data lake but provides greater reliability by ensuring all of the information stored within it is “clean” and without errors, Databricks said.

Data lakes are systems or repositories of data stored in its natural format, usually object “blobs” or files. They usually act as a single store of all enterprise data, including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning.

But Databricks said the information stored in traditional data lakes can be unreliable or inaccurate for several reasons. These include failed writes, schema mismatches and data inconsistencies, which arise when batch and streaming data is mixed together.

“For the last decade, organizations have been building data lakes but have been failing to gain insights from the data, Databricks Chief Executive Officer Ali Ghodsi told SiliconANGLE. “Because it is garbage in – garbage out, organizations run into issues with data quality, scalability and performance.”

This unreliable data can prevent companies from deriving business insights in a timely fashion, and also slows down initiatives such as machine learning model training, which require accurate and consistent data, the company said.

“Delta Lake addresses these challenges by ‘filtering’ the messy data and blocking access into the Delta Lake,” Ghodsi added. “The clean data sits in a Delta Lake on top of the data Lake. This level of data reliability is not offered in today’s data lakes.”

Delta Lake ensures data is kept accurate and reliable because it manages transactions across batch and streaming data, as well as multiple simultaneous writes. Companies using Apache Spark to analyze their data can tap into Delta Lakes as their main information source, so they don’t need to make changes to their data architectures. In addition, Delta Lakes do away with the need to build complicated data pipelines necessary to move information across different computing systems. All of a company’s information can be stored in a Delta Lake, and hundreds of applications can tap into it as necessary.

Delta Lakes also make life easier for individual developer. With a Delta Lake set up, developers can access it from their laptops and quickly build a data pipeline to whatever app they’re working on. They can also access earlier versions of each Delta Lake for auditing, rollbacks or to reproduce the results of their machine learning experiments. In addition, developers can convert their Parquets, which are commonly used formats for storing large datasets, into a Delta Lake, which avoids the need for heavy reading and writing of new data to the system.

“Delta Lake should be used by developers looking to transform their raw, unreliable data into ready-to-use, reliable data for machine learning initiatives,” Ghodsi said. “Delta Lake will simplify data engineering and eliminate the reliability problems developers run into every day.”

Analyst James Kobielus of SiliconANGLE sister market research firm Wikibon said Delta Lake actually sounds indistinguishable from a data warehouse, which he defines as a “single version of truth” governed repository of cleansed data that’s used by downstream apps for operational business intelligence, reporting, predictive modeling and other workloads.

“In other words, it really sounds like Databricks is broadening its go-to-market focus to address a wider range of traditional enterprise use cases, such as data warehousing,” Kobielus said. “But Delta Lakes begs the obvious question: What can it do that’s not already supported in what’s probably the most widely adopted open-source data warehousing project, Apache Hive, apart from being able to analyze the data in the warehouse using Spark?”

Delta Lake is available now under the Apache 2.0 license.

Photo: Lars_Nissen_Photoart/Pixabay

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU