UPDATED 16:56 EDT / AUGUST 27 2024

Tomer Shiran, Dremio Corp. - AWS reInvent 2022 BIG DATA

Dremio says it has dramatically improved query performance on Iceberg data lakes

Data lakehouse company Dremio Corp. today announced a set of advanced analytics performance capabilities that it says significantly speed query performance on Apache Iceberg tables while reducing the need for user intervention.

The two major new features are Live Reflections and Result Set Caching. Dremio Reflections are a feature of the company’s data lake engine that accelerates query performance by creating optimized, precomputed data representations. They’re similar in concept to materialized views but are more flexible and integrated with Dremio’s architecture. As a result, they enable faster and more interactive querying of large datasets stored in data lakes without data movement or duplication.

Live Reflections ensure that materialized views and aggregations are automatically updated for optimal performance whenever changes are made to base Iceberg tables. Users can accelerate queries without any maintenance overhead with the system recommending Reflections that provide the best value and system-wide performance.

“It used to be that you had to figure out which Reflections you wanted to create and then manage the refresh cycle,” said Dremio Founder Tomer Shiran (pictured). “You had to logically figure out what aggregations you needed, how to sort the table and how frequently to refresh. We’ve now solved both of those problems.”

Recommended Reflections essentially monitor activity across the entire data lake and learn what queries are being used most often and how they can be accelerated. Any updates to a table automatically refresh all the downstream Reflections incrementally, even if joins cross multiple tables.

Shiran said Apache Iceberg’s embedded change-tracking features make this possible. “You can note that the version of this table that was used for this query is the same as the version currently being queried,” he said. “I don’t have to worry that something may have changed. I know with certainty that it won’t return a different result than what the user expects.”

Result Set Caching can accelerate query responses up to 28-fold across all data sources by storing frequently accessed query results rather than just the queries, Dremio claimed. “People often query the same data,” Shiran said. “The optimizer takes the query plan, and asks if it can use one of the existing Reflections. The user isn’t aware of it.”

Storing query results instead of queries in the database consumes more storage but “object storage is cheap,” Shiran said. “Compute is expensive.”

A new data merge-on-read feature speeds Iceberg table writes and ingestion operations by up to 85%. Notification-based auto ingest ensures continuous updates with fresh data by automatically monitoring object storage for new files and automatically ingesting them when a notification is received.

“It’s all incremental and live, unlike in the past when you had to manually schedule an operation,” Shiran said. “Now you just insert the records automatically, and because all the updates are incremental, they’re cheap.”

Photo: SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU