Yahoo helps TensorFlow access that juicy Apache Spark data

data center

Yahoo Inc. announced today that it is open-sourcing the code for TensorFlowOnSpark, a software framework that combines the artificial intelligence brainpower of TensorFlow programs with the treasure trove of big data from the Apache Spark and Hadoop frameworks.

Deep learning is responsible for the creation of the most powerful AI programs around, from Google’s AlphaGo to Facebook’s Lumos. But it requires huge amounts of data to train new the algorithms. Many companies use Spark to process massive datasets, and allowing deep learning frameworks to access that data would offer a huge boost to AI development.

In a blog post, Yahoo explained the challenges its programmers faced when trying to do just that. “Existing [deep learning] frameworks often require us to set up separate clusters for deep learning, forcing us to create multiple programs for a machine learning pipeline,” the Yahoo Big ML team explained. “Having separate clusters requires us to transfer large datasets between them, introducing unwanted system complexity and end-to-end learning latency.”

To solve this problem, Yahoo created CaffeOnSpark, which allows programs made with the Caffe machine learning framework to work with Apache Spark. Yahoo already uses programs made with CaffeOnSpark to identify inappropriate images in search, as well as to automatically detect esports game highlights from streaming video.

Yahoo open-sourced CaffeOnSpark last year, and now it has decided to do the same with TensorFlowOnSpark, which it built using the same principles as CaffeOnSpark but with Google’s popular open-source TensorFlow machine learning library as its base. This allows TensorFlow programs to access all of the data companies process on Apache Spark.

According to Yahoo, transferring TensorFlow programs to TensorFlowOnSpark is a relatively painless process, and Yahoo has already been using it internally for some time. “Typically, changing fewer than 10 lines of Python code are needed,” Yahoo said. “Many developers at Yahoo who use TensorFlow have easily migrated TensorFlow programs for execution with TensorFlowOnSpark.”

Yahoo said it will continue to work on developing TensorFlowOnSpark and CaffeOnSpark, but it welcomes input from the open-source community on ways that each framework could be improved. You can read a more in-depth explanation for how TensorFlowOnSpark works in Yahoo’s full blog post.

Photo: The National Archives (UK) (The National Archives (UK)) [CC BY 3.0], via Wikimedia Commons