LinkedIn open sources TonY for running TensorFlow on Hadoop
LinkedIn Corp. is donating another internally created software project to the open-source community.
The now-Microsoft Corp.-owned company has a long history of open-source software contributions, including popular projects such as Apache Kafka and its more recent Dynamometer tool. Its latest effort, called “TensorFlow on YARN,” or “TonY” for short, is designed to help connect the open-source TensorFlow machine learning framework with data stored in Apache Hadoop.
TensorFlow is an open-source software library released in 2015 by Google LLC to make it easier for developers to design, build and train deep learning models. It’s one of the most popular frameworks for machine learning because it can train and run deep neural networks for handwritten digit classification, image recognition, word embeddings, recurrent neural networks, sequence-to-sequence models for machine translation and natural language processing, among other tasks.
Hadoop is a distributed processing software framework that manages data processing and storage for “big data” applications. It’s at the center of a growing ecosystem of big data technologies that are primarily used to support advanced analytics initiatives, including predictive analytics, data mining and machine learning applications.
LinkedIn software engineer Jonathan Hung said in a blog post that the company built TonY due to its increasing reliance on deep neural networks to power some of the features on its website, including news feeds and smart replies.
The problem LinkedIn faced is that many of these features are built using TensorFlow, which lacked a reliable way to connect to Hadoop clusters so it could use that data to train its algorithms.
“With hundreds of petabytes of data stored on our Hadoop clusters that could be leveraged for deep learning, we needed a scalable way to process all of this information,” Hung said.
TensorFlow already supported something called “distributed training,” which is a technique that’s useful for processing large datasets like those stored in Hadoop. But the main issue for LinkedIn was that this process needed to be orchestrated manually, which is “not a trivial task” and not something that most data scientists are qualified to do, Huang explained.
So Huang and his team set about creating TonY in order to automate this chore. The software works similarly to how MapReduce enables the running of Apache Pig or Apache Hive scripts on Hadoop, handling tasks such as “resource negotiation and container environment setup,” Huang said.
TonY offers a number of features that help to enhance distributed training jobs for neural networks, including GPU scheduling for better management of resources; support for TensorBoard, which makes it easier to debug and optimize TensorFlow programs; and better fault tolerance that allows users to restore their training status from previously saved checkpoints in the event of any problems.
Analyst Holger Mueller of Constellation Research Inc. said TonY is a showcase for open-source contribution because it solves the key problem of connecting TensorFlow to Hadoop while also demonstrating why other open-source projects don’t quite fit.
“LinkedIn gives a great use case for TonY, and with that the credibility that this is a working and supported open-source project,’ Mueller said. “It’s important for CxOs looking to power next-generation applications with TensorFlow because the data is in Hadoop already. It combines digital exhaust in Hadoop with one of the most popular deep learning-enabled neural networks.”
The other consideration for CxOs is that TonY was developed by LinkedIn, now owned by Microsoft Corp., so they can be assured it will continue to be supported in the long run.
“This gives a lot of enterprises a magnitude of confidence that they’ll have some sort of leverage in case of critical developments,” Mueller added.
LinkedIn said it is open-sourcing TonY so that others interested in running distributed machine learning on Hadoop can use and contribute to the project. TonY is available to download from GitHub starting today.
Image: LinkedIn
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU