Iceberg and Trino highlight enterprise demand for cross-platform data lake functionality
An iceberg is a piece of ice broken off from a glacier that floats freely in open water, and this definition also provides an appropriate name for the database table format project Iceberg.
Open-source Apache Iceberg provides full database functionality on top of cloud object stores. It exemplifies how the separation of storage from compute in modern data stacks has allowed scalable, cost-effective computing and improved interaction between various systems.
Two engineers at Netflix Inc. created Iceberg to overcome the challenges they encountered using existing data lake formats such as Apache Spark or Apache Hive. The engineers needed a solution to navigate their employers’ massive Amazon S3-stored media streaming files.
“We had the same problems that everybody else did, but 10 times worse,” recalled Ryan Blue, co-founder and chief executive officer of Tabular Technologies Inc. and creator of Apache Iceberg. “Every request to [Amazon’s] S3 was not seven milliseconds, it was 70 milliseconds … so all the things that you had to do really quickly to make sure your database doesn’t lie to you, we could no longer do really quickly. So, we had to solve this problem.”
Blue spoke with George Gilbert, senior analyst at theCUBE Research, in the latest episode of The Road to Intelligent Data Apps, theCUBE’s continuing conversation about the sixth data platform, an emerging framework in which the leading vendors are Databricks Inc., Snowflake Inc., Amazon Web Services Inc., Microsoft Azure, and Google LLC. He was joined by Dain Sundstrom, chief technology officer of Starburst Data Inc. and co-creator of Trino Inc. and Presto Automation Inc. They discussed the evolution and significance of separating storage from compute in modern data stacks.
Iceberg applied database fundamentals in an object store world
The problem Blue and his colleagues were trying to solve was that data lake formats such as Hive and Spark were tied to primary engines and the occasional single provider. By enabling any compute engine to interact with common data foundations, users could use Iceberg to work with any analytics engine. Major tech players such as Netflix Inc., Amazon Web Services, Snowflake Inc. and Databricks Inc. have widely adopted the open-source tool.
“We had to go and look at Hive tables and say, ‘You know what? That model of keeping track of what’s in our table is too simplistic; it’s not going to work in a world based on object stores,’” Blue said. “What if we applied database fundamentals? We designed for the constraints we were working with.”
For Sundstrom, the impetus for the open-source Trino distributed query engine came from the need to replace Facebook Inc.’s 300-petabyte Hive data warehouse. The goal was to enable fast, ad-hoc analytics queries over big data file systems.
“[Hive] was a great way to have less super skilled engineers be able to interact with the massive data sets Facebook had,” Sundstrom said. “The problem was it sucked. So, we came in to build a much more powerful distributed system using traditional database techniques.”
As the commercial developer of the distributed query engine based on Trino, Starburst has taken steps in recent months to make it easier for organizations to build applications on top of data lake architectures. In November, the company released a set of new features that provided unified data ingestion, governance and sharing on a single platform.
“Typically, you get into this problem of there’s just too much data to reasonably process; the queries are too big, and [customers] want to move to a more cost-effective solution,” Sundstrom said. “Often people start off with Starburst by just hooking it up and exploring the data in their existing platform because Trino supports federation.”
Starburst releases Icehouse data lake for data ingestion
In April, Starburst announced that it would release a fully managed Icehouse data lake on its cloud. Icehouse combines Trino and Iceberg storage to support near-real-time data ingestion into a managed Iceberg table at a petabyte scale.
“You can explore your data [and] you can play with it,” Sundstrom said. “When you want to hit optimal performance, you export it. We recommend you export to Iceberg. Of everything out there today, it is the best in terms of data lake formats, in my opinion.”
Iceberg’s modern table format for analytics continues to attract enterprise interest. Google, Snowflake and Databricks have all announced support for Iceberg, according to Blue. In its earnings call at the end of February, Snowflake noted that customer adoption of Iceberg tables could create “revenue headwinds” for the firm.
“Iceberg has two things that the other formats lack in some respect,” Blue said. “One is a strong technical foundation. The other is that open community, where it’s owned and controlled by the Apache Software Foundation. We really wanted this project to be something that was a foundational layer in data architecture, and we knew we needed to have a neutral community, a spec, solve all the problems.”
Here is the complete conversation, part of the The Road to Intelligent Data Apps series:
Image: Getty Images-Matthias Kulka
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU