UPDATED 08:00 EDT / OCTOBER 27 2020

BIG DATA

Dremio says its query engine eliminates the need for cloud data warehouses

Self-service analytics company Dremio Corp. today announced new technology that it claims can deliver sub-second query response times on cloud data lakes across thousands of concurrent users and queries.

The startup is also rolling out new integrations with Microsoft Corp.’s Power BI and Tableau Software Inc.’s Tableau visualization software that enables the tools to be launched via a live connection from within Dremio.

The new release enables production business intelligence workloads to be run directly on Amazon Web Services Inc. S3 and Microsoft Corp. Azure Data Lake Storage without requiring data to be pre-loaded into data warehouses, aggregation tables or extracts. That process, called extract/transform/load or ETL, can take weeks.

“The goal is to eliminate the need for a data warehouse,” said co-founder and Chief Product Officer Tomer Shiran (pictured, right, with co-founder Jacques Nadeau). “We can do what a data warehouse can do in the cloud without the need for ETL.” The company said the result is that multiple high-concurrency, low-latency SQL workloads such as BI dashboards can be run directly on a cloud data lake.

Cloud object storage was designed to be accessed remotely, Shiran said. “There are a variety of query engines, but none were designed for low-latency access,” he said. “They were designed for batch processing and because S3 is remote it has ‘noisy neighbor’ issues that can create performance problems.”

S3-specific caching

Dremio’s approach uses in-memory caching purpose-built for the S3 format. Rather than decompressing and de-serializing with each query, it pulls the data into memory for faster access. The software determines what data to keep in memory by continually analyzing what data is accessed most often. “Data is rarely accessed one time,” Shiran said. “Users tend to interact with the same dashboard and data.”

Dremio is based on Apache Arrow, an acceleration engine for analytics frameworks that uses columnar in-memory processing to speed performance by processing columns of data within the same field instead of reading individual records into memory. Apache Arrow advocates say the technique can result in up to a 100-fold performance improvement.

The new version can now cache data reflections, which are physically optimized representations of data, in the Apache Arrow format for direct loading into memory, thereby eliminating the need to decode and decompress at runtime. Support for multiple coordinator nodes enables workloads made up to thousands of simultaneous users and queries to run quickly.

Dremio has also added runtime intelligence from dimension tables to reduce the amount of data that must be read from a fact table, which is at the center of the star schema logical data structure that is common to data warehouses.  That speeds up performance by more than a factor of 100, the company claimed.

Runtime filtering interprets the fact table at runtime based on the minimum amount of data that’s required to execute the query. “Data warehouses use similar techniques but you need to load the data into the warehouse,” Shiran said. “To do that from the object store is much harder because we haven’t seen the data yet.”

Data lakes have historically been regarded as kind of untamed versions of data warehouses because they can store data in structured rows and columns as well as unstructured data like emails and word processing documents. However, Shiran said he believes data lakes are evolving toward use with very large amounts of structured and semi-structured data such as log files.

The new product features are available in community and enterprise versions in the AWS Marketplace.

Photo: Dremio

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU