UPDATED 18:33 EDT / JULY 24 2024

Researchers find that AI-generated web content could make large language models less accurate

A newly published research paper suggests that the proliferation of algorithmically generated web content could make large language models less useful.

The paper appeared today in the scientific journal Nature. It’s based on a recently concluded research initiative led by Ilia Shumailov, a computer scientist at the University of Oxford. Shumailov carried out the project in partnership with colleagues from the University of Cambridge, the University of Toronto and other academic institutions.

AI models produce a growing portion of the content available online. According to the researchers, the goal of their study was to evaluate what would happen in a hypothetical future where LLMs generate most of the text on the web. They determined that such a scenario would increase the likelihood of so-called model collapses, or situations where newly created AI models can’t generate useful output.

The issue stems from the fact that developers typically train their LLMs on webpages. In a future where most of the web comprises AI-generated content, such content would account for the bulk of LLM training datasets. AI-generated data tends to be less accurate than information produced by humans, which means using it to build LLMs can negatively decrease the quality of those models’ output.

The potential impact is not limited to LLMs. According to the paper’s authors, the issue also affects two other types of neural networks known as variational autoencoders and Gaussian mixture models.

Variational autoencoders, or VAEs, are used to turn raw AI training data into a form that lends itself better to building neural networks. VAEs can, for example, reduce the size of training datasets to lower storage infrastructure requirements. Gaussian mixture models, which are also impacted by the synthetic data issue flagged in today’s research paper, are used for tasks such as grouping documents by category.

The researchers determined that the issue not only affects multiple types of AI models but is also “inevitable.” They determined that that’s the case even in situations where developers create “almost ideal conditions for long-term learning” as part of an AI development project.

At the same time, the researchers pointed out that there are ways to mitigate the negative impact of AI-generated training datasets on neural networks’ accuracy. They demonstrated one such method in a test that involved OPT-125m, an open-source language model released by Meta Platforms Inc. in 2022.

The researchers created several different versions of OPT-125m as part of the project. Some were trained entirely on AI-generated content, while others were developed with a dataset in which 10% of the information was generated by humans. The researchers determined that adding human-generated information significantly reduced the extent to which the quality of OPT-125m’s output declined.

The paper draws the conclusion that steps will have to be taken to ensure high-quality content remains available for AI development projects. “To sustain learning over a long period of time, we need to make sure that access to the original data source is preserved and that further data not generated by LLMs remain available over time,” the researchers wrote. “Otherwise, it may become increasingly difficult to train newer versions of LLMs without access to data that were crawled from the Internet before the mass adoption of the technology.”

Image: Unsplash

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Researchers find that AI-generated web content could make large language models less accurate

Image: Unsplash

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

Supermicro Open Storage Summit 2025

World of Workato 2025

Future of Data Platforms Summit

VMware Explore 2025

CrowdStrike Fal.Con 2025

RECENT CUBE EVENTS

theCUBE + NYSE Wired: AI + Cloud Leaders Media Week 2025

AWS Summit NYC 2025

AWS Mid-Year Leadership Summit 2025

RAISE Summit 2025

Blue Yonder AI and the Autonomous Supply Chain 2025

Researchers find that AI-generated web content could make large language models less accurate

Image: Unsplash

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

Supermicro Open Storage Summit 2025

World of Workato 2025

Future of Data Platforms Summit

VMware Explore 2025

CrowdStrike Fal.Con 2025

theCUBE + NYSE Wired: AI + Cloud Leaders Media Week 2025

AWS Summit NYC 2025

AWS Mid-Year Leadership Summit 2025

RAISE Summit 2025

Blue Yonder AI and the Autonomous Supply Chain 2025

Cookies