UPDATED 11:00 EDT / SEPTEMBER 13 2018

BIG DATA

LinkedIn open sources TonY for running TensorFlow on Hadoop

LinkedIn Corp. is donating another internally created software project to the open-source community.

The now-Microsoft Corp.-owned company has a long history of open-source software contributions, including popular projects such as Apache Kafka and its more recent Dynamometer tool. Its latest effort, called “TensorFlow on YARN,” or “TonY” for short, is designed to help connect the open-source TensorFlow machine learning framework with data stored in Apache Hadoop.

TensorFlow is an open-source software library released in 2015 by Google LLC to make it easier for developers to design, build and train deep learning models. It’s one of the most popular frameworks for machine learning because it can train and run deep neural networks for handwritten digit classification, image recognition, word embeddings, recurrent neural networks, sequence-to-sequence models for machine translation and natural language processing, among other tasks.

Hadoop is a distributed processing software framework that manages data processing and storage for “big data” applications. It’s at the center of a growing ecosystem of big data technologies that are primarily used to support advanced analytics initiatives, including predictive analytics, data mining and machine learning applications.

LinkedIn software engineer Jonathan Hung said in a blog post that the company built TonY due to its increasing reliance on deep neural networks to power some of the features on its website, including news feeds and smart replies.

The problem LinkedIn faced is that many of these features are built using TensorFlow, which lacked a reliable way to connect to Hadoop clusters so it could use that data to train its algorithms.

“With hundreds of petabytes of data stored on our Hadoop clusters that could be leveraged for deep learning, we needed a scalable way to process all of this information,” Hung said.

TensorFlow already supported something called “distributed training,” which is a technique that’s useful for processing large datasets like those stored in Hadoop. But the main issue for LinkedIn was that this process needed to be orchestrated manually, which is “not a trivial task” and not something that most data scientists are qualified to do, Huang explained.

So Huang and his team set about creating TonY in order to automate this chore. The software works similarly to how MapReduce enables the running of Apache Pig or Apache Hive scripts on Hadoop, handling tasks such as “resource negotiation and container environment setup,” Huang said.

screenshot_2018-09-12-preview-linkedin-open-sources-framework-for-running-tensorflow-on-hadoop-mike-siliconangle-com-si-1

TonY offers a number of features that help to enhance distributed training jobs for neural networks, including GPU scheduling for better management of resources; support for TensorBoard, which makes it easier to debug and optimize TensorFlow programs; and better fault tolerance that allows users to restore their training status from previously saved checkpoints in the event of any problems.

Analyst Holger Mueller of Constellation Research Inc. said TonY is a showcase for open-source contribution because it solves the key problem of connecting TensorFlow to Hadoop while also demonstrating why other open-source projects don’t quite fit.

“LinkedIn gives a great use case for TonY, and with that the credibility that this is a working and supported open-source project,’ Mueller said. “It’s important for CxOs looking to power next-generation applications with TensorFlow because the data is in Hadoop already. It combines digital exhaust in Hadoop with one of the most popular deep learning-enabled neural networks.”

The other consideration for CxOs is that TonY was developed by LinkedIn, now owned by Microsoft Corp., so they can be assured it will continue to be supported in the long run.

“This gives a lot of enterprises a magnitude of confidence that they’ll have some sort of leverage in case of critical developments,” Mueller added.

LinkedIn said it is open-sourcing TonY so that others interested in running distributed machine learning on Hadoop can use and contribute to the project. TonY is available to download from GitHub starting today.

Image: LinkedIn

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.