UPDATED 09:00 EST / MAY 12 2015

NEWS

Pentaho harnesses Apache Spark for lightning-fast ETL

After a year of development, Pentaho Labs has finally finished adapting the data integration and preparation component of its widely-used business intelligence suite to work with Apache Spark. The integration marks its first major update since being acquired by Hitachi’s storage division earlier this year for a reported $500 to $600 million.

The landmark deal bought the Japanese conglomerate a proven foundation upon which to build its vision of a unified analytics platform for processing the different kinds of transmissions coming off the so-called industrial internet, where rival industrial giant General Electric Co. is pursuing a similar effort. The addition of support for Spark is a natural continuation of that ambitious initiative.

The Spark open-source execution engine can operate up to 100 times faster than the default processing option for Hadoop, which makes it much better equipped to extract timely answers from the massive amounts of information that Hitachi’s customers handle. Moreover, such large workloads typically incorporate multiple types of data that each have to be analyzed in a specific way, which is another strong suit of Spark.

The project can support not only the conventional batch analytics that Hadoop was originally built for but also stream processing and machine learning, functionality that covers practically every major requirement of the typical data initiative. The ability to perform all of that computation in a single framework instead of using a separate technology for each is a major operational boon that makes it that much more feasible for organizations to tap into their vast information troves.

The updated Pentaho Data Integration (PDI) engine promises to help analysts  make the most out of that feature set. The newly added interoperability will automatically translate workflows created through the company’s drag-and-drop interface into Spark jobs, thereby eliminating the need for manual implementation and further lowering the entry barrier to taking advantage of the technology.

The integration marks a major milestone along Hitachi’s efforts to catch up with GE, which started working on its own analytics platform several years earlier and has managed to open a sizable lead as a result. That support will likely be extended more of Spark’s features and additional use cases, as well as other emerging components from the upstream Hadoop ecosystem, as the Japanese giant continues to try and push ahead in the race over the Industrial Internet.
Photo by Tom Bullock via Flickr


A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU