UPDATED 17:08 EDT / NOVEMBER 11 2024

Tabilar’s Ryan Blue and Databurst’s Dain Sundstrom talk with theCUBE about advances such as Iceberg in database technology – CUBE Conversations 2024.

CLOUD

Iceberg and Trino highlight enterprise demand for cross-platform data lake functionality

An iceberg is a piece of ice broken off from a glacier that floats freely in open water, and this definition also provides an appropriate name for the database table format project Iceberg.

Open-source Apache Iceberg provides full database functionality on top of cloud object stores. It exemplifies how the separation of storage from compute in modern data stacks has allowed scalable, cost-effective computing and improved interaction between various systems.

Two engineers at Netflix Inc. created Iceberg to overcome the challenges they encountered using existing data lake formats such as Apache Spark or Apache Hive. The engineers needed a solution to navigate their employers’ massive Amazon S3-stored media streaming files.

“We had the same problems that everybody else did, but 10 times worse,” recalled Ryan Blue, co-founder and chief executive officer of Tabular Technologies Inc. and creator of Apache Iceberg. “Every request to [Amazon’s] S3 was not seven milliseconds, it was 70 milliseconds … so all the things that you had to do really quickly to make sure your database doesn’t lie to you, we could no longer do really quickly. So, we had to solve this problem.”

Blue spoke with George Gilbert, senior analyst at theCUBE Research, in the latest episode of The Road to Intelligent Data Apps, theCUBE’s continuing conversation about the sixth data platform, an emerging framework in which the leading vendors are Databricks Inc., Snowflake Inc., Amazon Web Services Inc., Microsoft Azure, and Google LLC. He was joined by Dain Sundstrom, chief technology officer of Starburst Data Inc. and co-creator of Trino Inc. and Presto Automation Inc. They discussed the evolution and significance of separating storage from compute in modern data stacks.

Iceberg applied database fundamentals in an object store world

The problem Blue and his colleagues were trying to solve was that data lake formats such as Hive and Spark were tied to primary engines and the occasional single provider. By enabling any compute engine to interact with common data foundations, users could use Iceberg to work with any analytics engine. Major tech players such as Netflix Inc., Amazon Web Services, Snowflake Inc. and Databricks Inc. have widely adopted the open-source tool.

“We had to go and look at Hive tables and say, ‘You know what? That model of keeping track of what’s in our table is too simplistic; it’s not going to work in a world based on object stores,’” Blue said. “What if we applied database fundamentals? We designed for the constraints we were working with.”

For Sundstrom, the impetus for the open-source Trino distributed query engine came from the need to replace Facebook Inc.’s 300-petabyte Hive data warehouse. The goal was to enable fast, ad-hoc analytics queries over big data file systems.

“[Hive] was a great way to have less super skilled engineers be able to interact with the massive data sets Facebook had,” Sundstrom said. “The problem was it sucked. So, we came in to build a much more powerful distributed system using traditional database techniques.”

As the commercial developer of the distributed query engine based on Trino, Starburst has taken steps in recent months to make it easier for organizations to build applications on top of data lake architectures. In November, the company released a set of new features that provided unified data ingestion, governance and sharing on a single platform.

“Typically, you get into this problem of there’s just too much data to reasonably process; the queries are too big, and [customers] want to move to a more cost-effective solution,” Sundstrom said. “Often people start off with Starburst by just hooking it up and exploring the data in their existing platform because Trino supports federation.”

Starburst releases Icehouse data lake for data ingestion

In April, Starburst announced that it would release a fully managed Icehouse data lake on its cloud. Icehouse combines Trino and Iceberg storage to support near-real-time data ingestion into a managed Iceberg table at a petabyte scale.

“You can explore your data [and] you can play with it,” Sundstrom said. “When you want to hit optimal performance, you export it. We recommend you export to Iceberg. Of everything out there today, it is the best in terms of data lake formats, in my opinion.”

Iceberg’s modern table format for analytics continues to attract enterprise interest. Google, Snowflake and Databricks have all announced support for Iceberg, according to Blue. In its earnings call at the end of February, Snowflake noted that customer adoption of Iceberg tables could create “revenue headwinds” for the firm.

“Iceberg has two things that the other formats lack in some respect,” Blue said. “One is a strong technical foundation. The other is that open community, where it’s owned and controlled by the Apache Software Foundation. We really wanted this project to be something that was a foundational layer in data architecture, and we knew we needed to have a neutral community, a spec, solve all the problems.”

Here is the complete conversation, part of the The Road to Intelligent Data Apps series:

Image: Getty Images-Matthias Kulka

A message from John Furrier, co-founder of SiliconANGLE:

Support our open free content by sharing and engaging with our content and community.

Join theCUBE Alumni Trust Network

Where Technology Leaders Connect, Share Intelligence & Create Opportunities

11.4k+

CUBE Alumni Network

C-level and Technical

Domain Experts

15M+

theCUBE

Viewers

Connect with 11,413+ industry leaders from our network of tech and business leaders forming a unique trusted network effect.

SiliconANGLE Media is a recognized leader in digital media innovation serving innovative audiences and brands, bringing together cutting-edge technology, influential content, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — such as those established in Silicon Valley and the New York Stock Exchange (NYSE) — SiliconANGLE Media operates at the intersection of media, technology, and AI. .

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a powerful ecosystem of industry-leading digital media brands, with a reach of 15+ million elite tech professionals. The company’s new, proprietary theCUBE AI Video cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Iceberg and Trino highlight enterprise demand for cross-platform data lake functionality

Iceberg applied database fundamentals in an object store world

Starburst releases Icehouse data lake for data ingestion

Image: Getty Images-Matthias Kulka

A message from John Furrier, co-founder of SiliconANGLE:

Join theCUBE Alumni Trust Network

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

Understanding Today's Digital Business With Dynatrace

Black Hat USA 2025

Open Storage Summit 2025

World of Workato 2025

VMware Explore 2025

RECENT CUBE EVENTS

AWS Mid-Year Leadership Summit 2025

RAISE Summit 2025

Blue Yonder AI and the Autonomous Supply Chain 2025

Data Protection & AI Summit 2025

Open Source Summit NA 2025

Iceberg and Trino highlight enterprise demand for cross-platform data lake functionality

Iceberg applied database fundamentals in an object store world

Starburst releases Icehouse data lake for data ingestion

Image: Getty Images-Matthias Kulka

A message from John Furrier, co-founder of SiliconANGLE:

Join theCUBE Alumni Trust Network

LATEST STORIES

LATEST STORIES

Understanding Today's Digital Business With Dynatrace

Black Hat USA 2025

Open Storage Summit 2025

World of Workato 2025

VMware Explore 2025

AWS Mid-Year Leadership Summit 2025

RAISE Summit 2025

Blue Yonder AI and the Autonomous Supply Chain 2025

Data Protection & AI Summit 2025

Open Source Summit NA 2025

Cookies