UPDATED 15:33 EDT / OCTOBER 30 2017

BIG DATA

Application container-friendly Pentaho 8 gets native Kafka support

The Hitachi Vantara subsidiary of Hitachi Ltd. has added support for Apache Kafka streaming data in version 8 of its Pentaho data integration and analytics software.

The move, announced on Oct. 26, extends the company’s embrace of the open-source ecosystem building around Apache Spark and its Spark Streaming extension, which are commonly used with Kafka.

Pentaho 8.0 fully enables stream data ingestion and processing using either its native streaming engine or Kafka. The stream processing capability builds upon existing Pentaho Spark integration with SQL, MLlib and the “adaptive execution layer” the vendor introduced in the spring. Apache Kafka is a lightweight, fast and highly scalable message broker that passes data between applications and is commonly used in Hadoop big data environments. A recent enhancement added “exactly once” delivery capabilities.

International Data Corp. estimates that the volume of data organizations produce will increase tenfold by 2025, that one quarter that data will be real-time and the “internet of things” will comprise 95 percent of that streaming volume, said Arik Pelkey, senior director of Pentaho product marketing at Hitachi Vantara. The company has revamped the architecture of its platform to accommodate other streaming engines and plans to include Apache Flink in the near future, he said.

The adaptive execution layer automatically maps data integration logic to the execution environment, reducing or eliminating the need for Spark programming. Users can match workloads to the most appropriate processing engine without the need to rewrite data integration logic. Adaptive execution has been made easier to set up, use and secure in the new release. As a result, Pelkey said, “you don’t have to be a developer anymore to work with Spark Streaming data.”

To better address the growing popularity of containers, Hitachi Vantara is also adding support for “worker nodes,” which are slimmed-down versions of its software optimized for speed and portability. “You can use worker nodes in the cloud or on-premises within a container to, for example, process multiple small jobs such as data transformation or reporting,” said Anand Rao, a  ‎senior product marketing manager. “These virtual nodes form part of a cluster so you don’t need the metadata repository to be replicated multiple times.”

Worker nodes support the in-line visualization features that the company also introduced this spring in an effort to make data integration simpler. The feature enables users to visualize data during the integration process in order to more easily spot outliers.

The new release also adds support for the Apache Knox Gateway to existing support for security protocols from Cloudera Inc. and Hortonworks Inc. The Knox gateway is used for authenticating users to Hadoop services. Also new is native support for the Apache Avro data serialization system and Apache Parquet columnar storage format. Native support makes it easier for users to read and write to those big data file formats and process with Spark using Pentaho’s visual editing tools. Availability is planned for next month.

Image: Flickr CC

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU