UPDATED 09:00 EDT / MAY 12 2015

NEWS

Pentaho harnesses Apache Spark for lightning-fast ETL

After a year of development, Pentaho Labs has finally finished adapting the data integration and preparation component of its widely-used business intelligence suite to work with Apache Spark. The integration marks its first major update since being acquired by Hitachi’s storage division earlier this year for a reported $500 to $600 million.

The landmark deal bought the Japanese conglomerate a proven foundation upon which to build its vision of a unified analytics platform for processing the different kinds of transmissions coming off the so-called industrial internet, where rival industrial giant General Electric Co. is pursuing a similar effort. The addition of support for Spark is a natural continuation of that ambitious initiative.

The Spark open-source execution engine can operate up to 100 times faster than the default processing option for Hadoop, which makes it much better equipped to extract timely answers from the massive amounts of information that Hitachi’s customers handle. Moreover, such large workloads typically incorporate multiple types of data that each have to be analyzed in a specific way, which is another strong suit of Spark.

The project can support not only the conventional batch analytics that Hadoop was originally built for but also stream processing and machine learning, functionality that covers practically every major requirement of the typical data initiative. The ability to perform all of that computation in a single framework instead of using a separate technology for each is a major operational boon that makes it that much more feasible for organizations to tap into their vast information troves.

The updated Pentaho Data Integration (PDI) engine promises to help analysts  make the most out of that feature set. The newly added interoperability will automatically translate workflows created through the company’s drag-and-drop interface into Spark jobs, thereby eliminating the need for manual implementation and further lowering the entry barrier to taking advantage of the technology.

The integration marks a major milestone along Hitachi’s efforts to catch up with GE, which started working on its own analytics platform several years earlier and has managed to open a sizable lead as a result. That support will likely be extended more of Spark’s features and additional use cases, as well as other emerging components from the upstream Hadoop ecosystem, as the Japanese giant continues to try and push ahead in the race over the Industrial Internet.
Photo by Tom Bullock via Flickr


A message from John Furrier, co-founder of SiliconANGLE:

Support our open free content by sharing and engaging with our content and community.

Join theCUBE Alumni Trust Network

Where Technology Leaders Connect, Share Intelligence & Create Opportunities

11.4k+  
CUBE Alumni Network
C-level and Technical
Domain Experts
15M+ 
theCUBE
Viewers
Connect with 11,413+ industry leaders from our network of tech and business leaders forming a unique trusted network effect.

SiliconANGLE Media is a recognized leader in digital media innovation serving innovative audiences and brands, bringing together cutting-edge technology, influential content, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — such as those established in Silicon Valley and the New York Stock Exchange (NYSE) — SiliconANGLE Media operates at the intersection of media, technology, and AI. .

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a powerful ecosystem of industry-leading digital media brands, with a reach of 15+ million elite tech professionals. The company’s new, proprietary theCUBE AI Video cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.