UPDATED 00:23 EDT / JANUARY 15 2016

NEWS

Yahoo has just made a massive 13.5TB machine learning dataset available

Yahoo has just handed over what it claims is the world’s largest-ever machine learning dataset to the academic research community through its ongoing program, Yahoo Labs Webscope. The company said it’s hopeful that the release will encourage more people – not just data scientists and researchers – to try their arm at machine learning.

“Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research,” said Suju Rajan, director of research for personalization science at Yahoo Labs, in the announcement.

The whopping 13.5TB dataset contains the anonymized data Yahoo has accumulated from the interactions of around 20 million of its users, from February 2015 to May 2015.

Yahoo Labs Webscope is a data-sharing project where Yahoo stores massive amounts of anonymized data. The company has now authorized its use for non-commercial purposes.

Much of the data has to do with Yahoo users’ interactions with the news feeds on Yahoo properties like Yahoo News, Yahoo Sports, Yahoo Movies and its home page. As well, Yahoo is providing lots of anonymized demographic data, like the ages, gender and locations of a subset of its anonymized users. The data also includes timestamps and other data from the end user’s device, as well as the title, summary and key phrases from the articles users have interacted with.

Yahoo’s donation comes as interest in machine learning rapidly gathers pace. A number of big Web companies, including Google and IBM, have recently open-sourced their machine learning algorithms to help researchers get closer to building machines and applications that can show true artificial intelligence.

“Machine learning is a core transformative way by which we are rethinking everything we are doing,” said Google CEO Sundar Pichai in October, shortly before it open-sourced its TensorFlow machine learning software.

More recently Microsoft got in the game, open-sourcing its DMTK machine learning toolkit. But Yahoo’s release is important in a different way, because it makes it possible for individuals and small organizations that don’t have the compute resources to begin using machine learning too.

“We hope that this data release will similarly inspire our fellow researchers, data scientists, and machine learning enthusiasts in academia, and help validate their models on an extensive, ‘real-world” dataset’,” Rajan said.

Image credit: Foundry via pixabay.com

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.