UPDATED 00:23 EDT / JANUARY 15 2016

NEWS

Yahoo has just made a massive 13.5TB machine learning dataset available

Yahoo has just handed over what it claims is the world’s largest-ever machine learning dataset to the academic research community through its ongoing program, Yahoo Labs Webscope. The company said it’s hopeful that the release will encourage more people – not just data scientists and researchers – to try their arm at machine learning.

“Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research,” said Suju Rajan, director of research for personalization science at Yahoo Labs, in the announcement.

The whopping 13.5TB dataset contains the anonymized data Yahoo has accumulated from the interactions of around 20 million of its users, from February 2015 to May 2015.

Yahoo Labs Webscope is a data-sharing project where Yahoo stores massive amounts of anonymized data. The company has now authorized its use for non-commercial purposes.

Much of the data has to do with Yahoo users’ interactions with the news feeds on Yahoo properties like Yahoo News, Yahoo Sports, Yahoo Movies and its home page. As well, Yahoo is providing lots of anonymized demographic data, like the ages, gender and locations of a subset of its anonymized users. The data also includes timestamps and other data from the end user’s device, as well as the title, summary and key phrases from the articles users have interacted with.

Yahoo’s donation comes as interest in machine learning rapidly gathers pace. A number of big Web companies, including Google and IBM, have recently open-sourced their machine learning algorithms to help researchers get closer to building machines and applications that can show true artificial intelligence.

“Machine learning is a core transformative way by which we are rethinking everything we are doing,” said Google CEO Sundar Pichai in October, shortly before it open-sourced its TensorFlow machine learning software.

More recently Microsoft got in the game, open-sourcing its DMTK machine learning toolkit. But Yahoo’s release is important in a different way, because it makes it possible for individuals and small organizations that don’t have the compute resources to begin using machine learning too.

“We hope that this data release will similarly inspire our fellow researchers, data scientists, and machine learning enthusiasts in academia, and help validate their models on an extensive, ‘real-world” dataset’,” Rajan said.

Image credit: Foundry via pixabay.com

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU