UPDATED 08:00 EDT / JULY 19 2017

BIG DATA

Dremio tackles self-service analytics with Apache Arrow-based data abstraction engine

After two years in stealth mode, Dremio Corp. is entering the red-hot self-service data analytics market today with an open-source platform based on the Apache Arrow distributed query engine.

Dremio said it eliminates the need for cumbersome tasks and technologies such as extract/transform/load procedures, data warehouses, multi-dimensional cubes and aggregation tables, providing ease of use without sacrificing security and proper data governance.

The Mountain View, California-based company, which was founded by a group of big data veterans that includes Jacques Nadeau, one of the co-developers of Arrow, has raised more than $15 million for a technology that it claims can work with any business intelligence front-end or data science tool while eliminating the need for data movement, a time-consuming process that frustrates many big data initiatives.

“One thing I observed is that when we sold companies Hadoop we also had to sell them professional services,” said Chief Executive Tomer Shiran (pictured), who previously worked at MapR Technologies Inc. “It took months to get business value.”

Hardware efficiency

Apache Arrow is designed to enable high levels of hardware efficiency by working in memory to the greatest extent possible while also minimizing serialization and deserialization of data buffers between Dremio and client technologies like Python, R and Spark. Serialization is a process of translating data structures or objects into a state that can be stores or buffered. Arrow is also designed to be used with graphical processing units and field-programmable gate array hardware accelerators, and it integrates with Python with “literally zero overhead,” Shiran said.

Data analysis and visualization platforms like Tableau Software Inc.’s Tableau do a good job of enabling end-user reporting, but they don’t address the underlying data preparation processes, Shiran said. “The way data is managed hasn’t fundamentally changed in 30 years,” he said. Dremio is partnering with several BI firms, including Microsoft Corp., Tableau and Qlik Inc. to integrate their front-end tools with Dremio’s data management engine.

Instead of performing full table scans for all queries, Dremio optimizes processing into underlying data sources by rewriting SQL queries in the native query language of each data source, such as Elasticsearch, MongoDB and HBase. The company has written connectors to popular relational database management engines, as well as to several non-relational sources, so that “all of your data in the company, no matter where it’s living, looks like it’s in one relational database, and a very fast one,” Shiran said. Dremio can perform joins across multiple data sources and is also optimized for file systems such as Amazon Web Services Inc.’s S3 and the Hadoop File System.

A single view of data

Machine learning is applied to help users write better queries over time. An Excel-like user interface enables users to join tables across multiple back-end sources, including tables that comprise both relational and non-relational sources. Machine learning helps the system observe the queries users create and can recommend useful joins.

Shiran said the approach is similar to the one Google uses to deliver rapid response by organizing data infrastructures that are optimized for particular queries. “Users are playing with one global catalog of data, and behind the scenes the system is optimizing their queries,” he said.

Dremio’s query planner automatically selects the best way to handle queries at run-time and optimizes for specific query patterns, such as columnar, compressed, aggregated, sorted, partitioned and co-located. The software also maintains multiple reflections of datasets in a user-readable format. Users have full visibility into how data is accessed, transformed, joined and shared, a feature that facilitates data governance and security.

The open-source edition of Dremio is distributed as a freely downloadable community edition. A separate enterprise edition will be licensed as an annual subscription with support, a commercial license and as-yet-unspecified enterprise features. Pricing for the enterprise edition has not yet been set but will be based on the number of compute nodes supported.

Image: Dremio

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU