UPDATED 09:00 EDT / MAY 12 2015

NEWS

Pentaho harnesses Apache Spark for lightning-fast ETL

After a year of development, Pentaho Labs has finally finished adapting the data integration and preparation component of its widely-used business intelligence suite to work with Apache Spark. The integration marks its first major update since being acquired by Hitachi’s storage division earlier this year for a reported $500 to $600 million.

The landmark deal bought the Japanese conglomerate a proven foundation upon which to build its vision of a unified analytics platform for processing the different kinds of transmissions coming off the so-called industrial internet, where rival industrial giant General Electric Co. is pursuing a similar effort. The addition of support for Spark is a natural continuation of that ambitious initiative.

The Spark open-source execution engine can operate up to 100 times faster than the default processing option for Hadoop, which makes it much better equipped to extract timely answers from the massive amounts of information that Hitachi’s customers handle. Moreover, such large workloads typically incorporate multiple types of data that each have to be analyzed in a specific way, which is another strong suit of Spark.

The project can support not only the conventional batch analytics that Hadoop was originally built for but also stream processing and machine learning, functionality that covers practically every major requirement of the typical data initiative. The ability to perform all of that computation in a single framework instead of using a separate technology for each is a major operational boon that makes it that much more feasible for organizations to tap into their vast information troves.

The updated Pentaho Data Integration (PDI) engine promises to help analysts make the most out of that feature set. The newly added interoperability will automatically translate workflows created through the company’s drag-and-drop interface into Spark jobs, thereby eliminating the need for manual implementation and further lowering the entry barrier to taking advantage of the technology.

The integration marks a major milestone along Hitachi’s efforts to catch up with GE, which started working on its own analytics platform several years earlier and has managed to open a sizable lead as a result. That support will likely be extended more of Spark’s features and additional use cases, as well as other emerging components from the upstream Hadoop ecosystem, as the Japanese giant continues to try and push ahead in the race over the Industrial Internet.
Photo by Tom Bullock via Flickr

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Pentaho harnesses Apache Spark for lightning-fast ETL

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

KubeCon + CloudNativeCon EU 2026

RSAC 2026 Conference

Nvidia GTC 2026

Google Cloud AI Agents in Action Series 2025/2026

MWC Barcelona 2026

Pentaho harnesses Apache Spark for lightning-fast ETL

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

KubeCon + CloudNativeCon EU 2026

RSAC 2026 Conference

Nvidia GTC 2026

Google Cloud AI Agents in Action Series 2025/2026

MWC Barcelona 2026

Cookies