

LinkedIn Corp. Thursday donated yet another internally built tool to the open-source community: a conversion tool that transforms data from Apache Spark into a format that can easily be consumed by TensorFlow for machine learning purposes.
TensorFlow is one of the most popular and widely used frameworks for running machine learning, deep learning and other statistical and predictive analytics workloads. Apache Spark is an open source big-data processing engine that’s designed to execute streaming, machine learning or SQL workloads that require fast and constant access to datasets.
LinkedIn’s new tool, called Avro2TF, enables data scientists and other users to convert datasets stored in the Apache Avro format commonly used by LinkedIn’s engineers into a pattern that can be easily consumed by TensorFlow. The benefit is a simple but useful one: It frees up engineers and developers to focus on their machine learning models.
Avro2TF is just the latest in a series of machine learning-based tools LinkedIn has donated to the open-source community, in line with its stated mission to “democratize machine learning.”
“One of the important lessons we have learned from this journey is the importance of providing good deep learning platforms that help our modeling engineers become more efficient and productive,” LinkedIn engineers Xuhong Zhang, Chenya Zhang and Yiming Ma wrote in a blog post. “Avro2TF is part of this effort to reduce the complexity of data processing and improve the velocity of advanced modeling.”
LinkedIn’s engineers explained that they built Avro2TF to address their need for a solution focused on “scalable data conversion.” The tool is said to support all kinds of Spark-readable data formats, including optimized row columnar, sparse vector and dense vector data.
Here’s where Avro2TF fits into the TensorFlow stack:
LinkedIn said it believes that many organizations will be able to benefit from Avro2TF because the Microsoft Corp. company isn’t the only one that has been grappling with the challenge of converting data for machine learning purposes.
“We believe that this is not only a LinkedIn problem, many companies have vast amount of ML data in similar sparse vector format, and Tensor format is still relatively new to many companies,” the engineers said. “Avro2TF bridges this gap by providing scalable Spark based transformation and extensions mechanism to efficiently convert the data into TF records that can be readily consumed by TensorFlow.”
Analyst Holger Mueller of Constellation Research Inc. told SiliconANGLE there should be many organizations that are eager to use Avro2TF, since it provides a vital link between two popular open-source technologies.
“These ‘bridge’ open-source projects are vital for enterprises to build next-generation apps because they don’t have the resources that LinkedIn has to build them,” Mueller said.
LinkedIn said Avro2TF is available to download on GitHub along with a tutorial on how to get it up and running.
THANK YOU