UPDATED 17:10 EST / JUNE 14 2022

BIG DATA

Data Power Panel: How lakehouses aim to be the modern data analytics platform

A new generation of object store-based data lakes is rising to prominence, driven by the move to cloud computing and 2021’s record-breaking venture capital investment. The rise is marked by the emergence of three trends.

The first of these is the combination of data lakes and data warehouses into a lakehouse. This new category links data engineering, data science and data warehouse workloads on a single shared data platform and is a potential contender for the data platform of the future, according to theCUBE analyst team.

The second is that query engines and broader data fabric virtualization platforms are using modern data lakes as platforms for SQL-centric business intelligence workloads. This reduces or potentially eliminates the need for separate data warehouses.

The third trend is the rise in popularity of data fabric or data mesh architectures. This is driven by companies that have adopted data lakes as fundamental to their data strategy but are also keeping their traditional data warehouse estate.

These trends and other emerging data strategy options and their associated tradeoffs were the subject of a recent data power panel on “How Lakehouses Aim to be the Modern Data Analytics Platform,” an exclusive broadcast on theCUBE, SiliconANGLE Media’s livestreaming studio.

“A battle royale is brewing between cloud data warehouses and cloud lakehouses,” theCUBE industry analyst Dave Vellante said as he kicked off the in-depth panel discussion. “Is it possible to do it all with one cloud center analytical data platform?”

Joining Vellante were independent data experts Sanjeev Mohan (pictured, left), principal at SanjMo; Tony Baer (pictured, center), principal at dbInsight LLC; and Doug Henschen (pictured, right), vice president and principal analyst at Constellation Research Inc.

The evolution of lakehouse

The concept of a single platform to address business intelligence, data science and data engineering goes back to 2012, when Cloudera Inc. introduced the Apache Impala database on top of Hadoop, according to Henschen.

“Later in that decade with the shift to cloud and object storage, you saw the vendor shift to this whole cloud and object storage idea,” he said.

While the term came to prominence through a 2020 engineering blog published by Databricks Inc., “the concept of lakehouse was going on a long time ago, long before the term was invented,” according to Mohan, who gave the example of Uber Inc. attempting to gain transactional capabilities that its existing Hadoop framework didn’t have by adding SQL extensions.

“They weren’t calling it the lakehouse. They were using multiple technologies, but now they’re able to collapse it into a single data store that we call lakehouse,” Mohan said. “Data lakes are excellent at batch processing large volumes of data, but they don’t have the real-time capabilities such as change data capture, doing inserts and updates. So this is why lakehouse has become so important — because they give us these transactional capabilities.”

The evolution of the data lakehouse is a continuum of platforms that gradually blend one into another, Baer explained. Rather than a one-off defining moment, it started with SQL on Hadoop, then data warehouses reaching out to the Hadoop Distributed File System, and then the silos breaking down even further with cloud and cloud-native.

But the basic premise is “an attempt by the data lake folks to make the data lake friendlier territory to the SQL folks, and also to get into friendly territory [for] all the data stewards who are basically concerned about the sprawl and the lack of control in governance in the data lake,” Baer added.

Why lakehouse needs to reach maturity

Going deeper into the concept of the lakehouse, is the term was mostly marketing hype or are real-world practical examples currently driving business results? In other words: Is lakehouse a mature concept?

The response from the Power Panel analysts was a resounding no.

“Even though the idea of blending platforms has been going on for well over a decade, I would say that the current iteration is still fairly immature,” Baer said. “We’re still very early in the game in terms of maturity of data lakehouses.”

A prime example is Databricks Inc., which wants customers to believe that its lakehouse platform is a natural extension of the data lak, according to Baer.

“Databricks had to go outside its core technology of Spark to make the lakehouse possible,” he said. Databricks SQL is not Spark SQL, Baer added. Instead, it’s SQL that has been adapted to run in a Spark environment, with the underlying engine based on C++.

There are two problems: The lakehouse struggles with handling metadata and there is a lack of standardization and interoperability between solutions, according to Henschen.

“All these open-source vendors, they’re running what I call ego projects,” he said, describing how he sees the battles playing out on social media. However, the end user just wants their problem solved with whatever works.

“They want to use Trino, Dremio, Spark on EMR, Databricks, Ahana, DaaS, Flink, Athena,” Henschen said, naming off an assorted list that included open-source analytics projects, lakehouse vendors and data solutions.

What lies ahead for the data analytics market?

The end goal of any data analytics platform is to provide consistency and scalability, with end users wanting an open performance standard, according to Henschen. But the market is going to have to find a solution that meets the needs of both the traditional SQL database camp and the data lake supporters. It isn’t an easy problem to solve.

“The SQL folks are from Venus, and the data scientists are from Mars. It really comes down to that type of perception,” Baer said.

Ways in which the market will head are discussed by the analysts in detail. Topics include the development of a semantic layer to link the two worlds, the controversial possibility of a data mesh take-over, and examination of the long-term prospects for open-source projects, such as Apache Iceberg and Hudi. Throughout the entire conversation, the analysts provide vendor and product examples to illustrate their points. We’re “naming names,” Vellante said.

“The problem in our space is that there are way too many companies, there’s way too much noise. We are expecting the end users to parse it out or we expect analyst firms to boil it down,” Mohan said. “At the end of the day, the end user will decide what is the right platform, but we are going to have multiple formats living with us for a long time.”

Here’s the complete Data Power Panel conversation:

Photo: SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU