UPDATED 15:52 EDT / JUNE 30 2016

NEWS

3 ways Yahoo employed Hadoop to optimize utilization | #HS16SJ

In the last three years, the demands from customers have grown exponentially. Like many companies, Yahoo, Inc. is adapting to better serve its customers and provide a better user experience. In his talk as keynote speaker today during Hadoop Summit 2016 in San Jose, CA, Mark Holderbaugh, senior director of Hadoop Engineering at Yahoo, discussed the three major highlights of what Yahoo has done to meet these growing demands.

1. YARN

The first thing Yahoo did was look at YARN, a cluster management technology, as a tool to increase utilization. Due to the YARN schedule, they were only getting 40 percent, so Yahoo needed to find a way to increase that figure. Based on feedback from the nodes, they were able to adjust and work on getting that percentage point up to a more favorable number.

YARN turned out to be a worthy investment that returned better utilization.

2. Migration to Tez

The second action Yahoo took was to migrate to Apache Tez, which is aimed at building an application framework that allows for a complex directed-acyclic-graph of tasks for processing data.

“Tez is the key that gives us up to zero minimal changes to jobs,” said Holderbaugh. This allowed the company to run millions of jobs and raise utilization 50 percent, just by switching to Tez. However, it was not a dynamic knife switch on each cluster, Holderbaugh emphasized. Each job has different specifications and changes every day, so it all has to be done individually on a case-by-case basis. This switch resulted in a reduction of runtime hours and memory. Yahoo had a 30 percent gain just from switching one pipeline.

It also started migrating Apache Hive jobs from Tez. “Hive gave us better utilization and improved latencies, allowing us to do more demand-type latent jobs,” said Holderbaugh. It shows that these latencies are on which node and increases the ability to get those latencies out of jobs. This avoids the need for extra clusters, and ultimately saves money.

3. Apache Storm

The third area Yahoo focused on was Apache Storm utilization. According to Holderbaugh, Yahoo has embraced Storm (an open-source distributed real-time computation system) since its creation in 2012. Storm is now being used in every part of Yahoo. It’s being used for data analyzation, as well as monitor clusters, and the utilization is even lower than in Hadoop clusters. Yahoo’s reasoning for doing this was a mission to keep the fine-grained stuff while improving utilization.

Holderbaugh also emphasized that these topics were the subject for many panels already in the course of the Hadoop Summit, and even more planned for today. The schedule is rife with opportunities to learn more about Yahoo’s efforts at optimization and much more. He encouraged the audience to go back and watch these talks or attend what they could for inspiration.

Hadoop in the cloud

After Holderbaugh finished his talk, Sanjay Radia, founder and architect at Hortonworks, Inc., took the stage. Radia’s talk focused on why you would want to put Hadoop in the cloud. First of all, it’s not actually a new idea, he said. Companies have been doing it for years. One major reason for that is the time and money you can save. There are no hardware costs involved, you don’t need an expert on staff and the cloud offers more elasticity. It can even create a cluster in minutes. In addition, it takes away some of the complexity by offering pre-tuned clusters.

Raida also emphasized that having shared data “fundamentally means we need shared management.” He talked about how important it is to have a shared metadata so that the data isn’t replicated and taking up needless space, which translates easily into a waste of money. You can accomplish this by having a shared database server, Radia said.

Collaboration and the cloud is certainly a theme at this year’s Hadoop Summit, as is using emerging technology to help your company run more smoothly and cost effectively than ever before.

Watch the full video interview below, and be sure to check out more of SiliconANGLE and theCUBE’s coverage of the Hadoop Summit US.

Photo by SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU