UPDATED 22:29 EST / JUNE 28 2016

NEWS

Don’t let your data lake turn into a data swamp | #HS16SJ

by Nelson Williams

Data does not move easily. This truth has plagued the world of Big Data for some time and will continue to do so. In the end, the laws of physics dictate a speed limit, no matter what else is done. However, somewhere between data at rest and the speed of light, there are many processes that must be performed to make data mobile and useful. Integrating data and managing a data pipeline are two of these necessary tasks.

To shed some light on the world of data preparation, John Furrier (@furrier) and George Gilbert (@ggilbert41), cohosts of theCUBE, from the SiliconANGLE Media team, visited the Hadoop Summit US 2016 event in San Jose, California. There, they sat down with Chuck Yarbrough, senior director of Solutions Marketing and Management at Pentaho (A Hitachi Group Company).

Managing the data pipeline

The discussion started with a look at Pentaho and what they do. Yarbrough took the hosts through a tour of the company’s history, saying that early on the founders looked at what data analytics was all about and what it would become. Its idea was to do data integration and do it right to prepare data for the analytic process. It had a vision to manage the entire data pipeline for an analytic purpose.

Yarbrough then explained the solution, stating that Pentaho enables high-scale, complex use cases that require the entire pipeline. That data can be highly varied, coming in from all over the place. Blending and processing that varied data on the fly is the key, and that’s where Pentaho delivers value.

Keeping the data lake clean

Throwing a bunch of data into one place creates a data lake, but if that information isn’t managed, the lake becomes a swamp. Yarbrough asked how does a company manage that data at scale? One load is simple, but 6,000 loads is something else. He described how Pentaho manages that data by leveraging the concept of metadata injection and making processes dynamic.

“Manage what you’re doing,” he said.

Yarbrough then stressed that it always comes down to use cases, what the company is trying to do with its data. Customers want to take data from their lakes and format it into something different. The blueprint Pentaho produced does just that, simplifying the process and allowing large, at-scale data movement.

Watch the entire video interview below, and be sure to check out more of SiliconANGLE and theCUBE’s coverage of the Hadoop Summit US.

Photo by SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.