Researchers find that AI-generated web content could make large language models less accurate
A newly published research paper suggests that the proliferation of algorithmically generated web content could make large language models less useful.
The paper appeared today in the scientific journal Nature. It’s based on a recently concluded research initiative led by Ilia Shumailov, a computer scientist at the University of Oxford. Shumailov carried out the project in partnership with colleagues from the University of Cambridge, the University of Toronto and other academic institutions.
AI models produce a growing portion of the content available online. According to the researchers, the goal of their study was to evaluate what would happen in a hypothetical future where LLMs generate most of the text on the web. They determined that such a scenario would increase the likelihood of so-called model collapses, or situations where newly created AI models can’t generate useful output.
The issue stems from the fact that developers typically train their LLMs on webpages. In a future where most of the web comprises AI-generated content, such content would account for the bulk of LLM training datasets. AI-generated data tends to be less accurate than information produced by humans, which means using it to build LLMs can negatively decrease the quality of those models’ output.
The potential impact is not limited to LLMs. According to the paper’s authors, the issue also affects two other types of neural networks known as variational autoencoders and Gaussian mixture models.
Variational autoencoders, or VAEs, are used to turn raw AI training data into a form that lends itself better to building neural networks. VAEs can, for example, reduce the size of training datasets to lower storage infrastructure requirements. Gaussian mixture models, which are also impacted by the synthetic data issue flagged in today’s research paper, are used for tasks such as grouping documents by category.
The researchers determined that the issue not only affects multiple types of AI models but is also “inevitable.” They determined that that’s the case even in situations where developers create “almost ideal conditions for long-term learning” as part of an AI development project.
At the same time, the researchers pointed out that there are ways to mitigate the negative impact of AI-generated training datasets on neural networks’ accuracy. They demonstrated one such method in a test that involved OPT-125m, an open-source language model released by Meta Platforms Inc. in 2022.
The researchers created several different versions of OPT-125m as part of the project. Some were trained entirely on AI-generated content, while others were developed with a dataset in which 10% of the information was generated by humans. The researchers determined that adding human-generated information significantly reduced the extent to which the quality of OPT-125m’s output declined.
The paper draws the conclusion that steps will have to be taken to ensure high-quality content remains available for AI development projects. “To sustain learning over a long period of time, we need to make sure that access to the original data source is preserved and that further data not generated by LLMs remain available over time,” the researchers wrote. “Otherwise, it may become increasingly difficult to train newer versions of LLMs without access to data that were crawled from the Internet before the mass adoption of the technology.”
Image: Unsplash
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU