UPDATED 13:19 EDT / OCTOBER 24 2012

Impala Expands Real-Time Query in Hadoop, Empowers SQL BI, Visualization Tools

Impala changes the equation for Hadoop, giving users answers to their queries in seconds rather than hours, says Cloudera CEO Mike Olson in the Cube at the Strata/HadoopWorld 2012 conference.

His chief scientist, Jeff Hammer, agrees, saying that Impala is the first tool to change how he uses Hadoop. “I use it every day.”

Impala, which Cloudera announced at the start of the conference, is a distributed real-time query engine that works with HBase and HDFS. It accepts SQL language queries, empowering traditional SQL query and visualization engines to run directly on Hadoop databases rather than requiring them to work through connections from traditional data warehouses.

“We’ve known for a long time that batch data processing solves only part of the Big Data problem,” Olson said. “Not every user and workload can tolerate the latency.”

Hadoop can handle any kind of data – structured as well as unstructured. Impala does not replace Hadoop batch processing complex analysis tools. Rather it allows business users as well as data scientists to run simpler, interactive queries and get answers “at the speed of thought.” And it allows business executives to use the SQL query tools they already know rather than having to learn MapReduce. Thus it is an augmentation rather than a replacement for the Hadoop batch query tools, whose strength is in handling more complex queries.

Nor does it necessarily mean that Hadoop should or will replace the data warehouses in every organization. Olson, who self-identifies as an “old guard relational developer from the RDBMS industry” going back to its beginnings in the 1980s, argues that RDBMS database produces are excellent for what they do. “If you are doing banking transactions or OLAP, you will continue to run on your RDBMS data warehouse.”

Nor does it invalidate Oracle’s strategy, enunciated by Oracle CEO Larry Ellison at OracleWorld 2012 recently, of scrubbing the data in Hadoop and then “blasting it into the Big Iron” of the Oracle DW, Olson argues. Oracle is a CloudEra partner, and Olson argues that the right solution depends on the needs of the particular user.

But Impala allows users to do things with Hadoop, using the expanded data types it supports, that they could not do before. While it is not as fast as a high-end RDBMS system, it is a much less expensive solution, which makes applications that do not need the very high performance big-demand RDBMS systems more practical. Nor does it replace HBase. Rather, he says, it provides a real time solution for a specific set of users with different needs from those who use HBase or who program very complex queries with MapReduce. Nor will it be the last such query engine. “In the next two years Hadoop will get more real-time workloads that will attack different programming paradigms,” he predicted. He sees several interesting development projects going on in academia, and he promised to add those that catch on in the user community to CloudEra’s platform.

Specifically, says Hammerbacker, “HBase is good if you can specify a row or column. Solr goes past that to allow analysis of free text across many columns or within a field. Impala is solving the problem aggregating data across multiple tables.” Then a new generation of Open Source offerings are appearing aimed at processing data streams before they even hit the storage system, which is “another interesting class of real-time analysis.”

The next step for improving Impala’s performance, he said, is developing sophisticated joining algorithms. Flash does not provide that much of an advantage, only about a 2X to 3X improvement, which, given the differential in cost between flash and disk, makes that an impractical solution.

However, the problems that most interest him are developing better tools for cleaning non-relational data in Hadoop and then developing technologies to support analysis models such as regression and decision tree. “That comes down to the optimization algorithms. I want to parallelize that across the cluster so you don’t have to leave the BI tool you already know to work with Hadoop.”


A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU