UPDATED 22:29 EDT / JUNE 28 2016

NEWS

Don’t let your data lake turn into a data swamp | #HS16SJ

Data does not move easily. This truth has plagued the world of Big Data for some time and will continue to do so. In the end, the laws of physics dictate a speed limit, no matter what else is done. However, somewhere between data at rest and the speed of light, there are many processes that must be performed to make data mobile and useful. Integrating data and managing a data pipeline are two of these necessary tasks.

To shed some light on the world of data preparation, John Furrier (@furrier) and George Gilbert (@ggilbert41), cohosts of theCUBE, from the SiliconANGLE Media team, visited the Hadoop Summit US 2016 event in San Jose, California. There, they sat down with Chuck Yarbrough, senior director of Solutions Marketing and Management at Pentaho (A Hitachi Group Company).

Managing the data pipeline

The discussion started with a look at Pentaho and what they do. Yarbrough took the hosts through a tour of the company’s history, saying that early on the founders looked at what data analytics was all about and what it would become. Its idea was to do data integration and do it right to prepare data for the analytic process. It had a vision to manage the entire data pipeline for an analytic purpose.

Yarbrough then explained the solution, stating that Pentaho enables high-scale, complex use cases that require the entire pipeline. That data can be highly varied, coming in from all over the place. Blending and processing that varied data on the fly is the key, and that’s where Pentaho delivers value.

Keeping the data lake clean

Throwing a bunch of data into one place creates a data lake, but if that information isn’t managed, the lake becomes a swamp. Yarbrough asked how does a company manage that data at scale? One load is simple, but 6,000 loads is something else. He described how Pentaho manages that data by leveraging the concept of metadata injection and making processes dynamic.

“Manage what you’re doing,” he said.

Yarbrough then stressed that it always comes down to use cases, what the company is trying to do with its data. Customers want to take data from their lakes and format it into something different. The blueprint Pentaho produced does just that, simplifying the process and allowing large, at-scale data movement.

Watch the entire video interview below, and be sure to check out more of SiliconANGLE and theCUBE’s coverage of the Hadoop Summit US.

Photo by SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU