Databricks nudges closer to bridging the gap between data lakes and warehouses
Continuing its quest to make freeform data lakes a viable alternative to highly structured data warehouses, Databricks Inc. today debuted an engine that enables many workloads previously targeted at data warehouses to be executed on data lakes instead.
SQL Analytics is said to combine data warehousing performance with data lake economics to enable SQL queries of a data lake to perform up to nine times faster than they would on a data warehouse. It’s another building block in the construction of what the company calls “lakehouse,” which is an architecture that combines both types of workloads without requiring extract databases to be created for warehousing purposes.
Data warehouses are highly structured databases that combine information from multiple sources in a single repository that can be queried to discover new relationships between data elements. Data lakes are centralized repositories that combine structured, unstructured and semistructured data and are commonly used for machine learning and data science applications.
Conventional wisdom holds that the two architectures are fundamentally incompatible, but Databricks believes it can find common ground.
“This is about providing not just a first class data science platform for machine learning and data science but also for queries in a way that is highly performant, low latency and at high user concurrency,” said Joel Minnick, Databricks’ vice president of marketing. “We believe the data lake is the center of gravity because it’s so good at handling the unstructured information that data science and machine learning innovation comes from. Data warehouses weren’t built for that. ”
A lakehouse supports both kinds workloads with a single architecture. SQL Analytics is built on Delta Lake, an open-source table storage layer created by Databricks and released to open source a year ago. It provides some of the data reliability and quality features that data lakes typically lack.
“Delta Lake provides reliability by making data lakes operate in transactional way,” Minnick said. It does that by adding a transaction log to the data lake that supersedes the data itself.
“So now I’m querying transaction logs to get the single source of truth regardless of the data in the lake itself,” Minnick said. “By running SQL workloads directly on the data lake, I can substantially reduce the number of ETL [extract/transfer/load] pipelines I have to maintain.” That translates into fewer copies of data and less risk of conflict.
Data quality matters
Minnick said SQL Analytics does not eliminate the need to massage data for consistency or to conduct ETL if moving the data elsewhere. “If the data’s messy, then the data’s messy, but not having to push it around is an advantage for a lot of our customers,” he said. “By doing the transformations on the data lake, everyone is using the same data set and there’s one source of truth.”
Although various tools have long enabled SQL queries to be performed on data lakes, performance has typically been a downside. Databricks said it has come up with two ways to improve responsiveness. The first is by creating auto-scaling endpoints that keep query latency consistently low under high user load. The second is Delta Engine, which it said can complete queries quickly against data sets of any size.
“With Delta Engine we were able to solve the throughput issue,” Minnick said. “With SQL Analytics, customers can create SQL-tuned clusters that stand up or spin down based on the number of users querying that data lake.” That means customers can get the concurrency benefits of a data warehouse without leaving the data lake environment.
Databricks said SQL Analytics doesn’t obviate the need for a data warehouse but can handle most warehouse-like applications that don’t require writing updates or driving operational processes. “Right now we’re focusing mainly on business intelligence analytics and reporting,” rather than write-intensive processes, Minnick said.
Databricks, which is privately held, said it achieved a greater than $350 million revenue run rate in the third quarter of 2020, up from $200 million in the same quarter last year.
SQL Analytics will be available for public preview on Nov. 18.
Photo: Pok_Rie/Pixabay
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU