

Researchers have found child sexual abuse material in LAION-5B, an open-source artificial intelligence training dataset used to build image generation models.
The discovery was made by the Stanford Internet Observatory, or SIO, which detailed its findings in a Tuesday report. SIO researchers have identified more than 1,000 exploitative images of children in LAION-5B. They detailed in their report that they evaluated only a subset of the files in the database, which means it likely contains thousand of additional CSAM images that have not yet been found.
SIO identified the illegal images with a data management technique called hashing. Using the technique, researchers can turn a file into a unique series of letters and numbers called a hash. After creating hashes of the images in LAION-5B, the SIO researchers compared them against the hashes of known CMAS images.
“Removal of the identified source material is currently in progress as researchers reported the image URLs to the National Center for Missing and Exploited Children (NCMEC) in the U.S. and the Canadian Centre for Child Protection (C3P),” SIO researchers wrote. “The study was primarily conducted using hashing tools such as PhotoDNA, which match a fingerprint of an image to databases maintained by nonprofits that receive and process reports of online child sexual exploitation and abuse.”
LAION-5B was released in early 2022 by a German nonprofit that has received funding from several AI startups. The dataset comprises more than 5 billion images scraped from the web and accompanying captions. It’s an upgraded version of earlier AI training dataset, called LAION-400M, that was published by the same nonprofit a few months earlier and includes about 400 million images.
In a statement issued to Bloomberg today, the nonprofit stated that it has a zero-tolerance policy for illegal content. It has deleted multiple versions of the dataset from the internet “to ensure they are safe before republishing them.” Additionally, the nonprofit has released filters for finding and removing illegal content from its datasets.
Since its release last year, LAION-5B has been used to train multiple image generation models. SIO determined that some of those models are used to generate CSAM images.
One of the highest-profile companies to have leveraged LAION-5B to train its neural networks is Stability AI Ltd., the startup behind the popular Stable Diffusion series of image generation models. The company told Bloomberg that the relatively recent 2.0 version of Stability Diffusion wasn’t trained on LAIONk-5B, but rather a subset of the dataset with less unsafe content. Stability AI has also equipped its newer models with filters designed to block unsafe inputs and outputs.
The new SIO report doesn’t mark the first time the LAION-5B dataset has come under scrutiny. Early last year, three artists filed a lawsuit against Stability AI and two other companies that allegedly used millions of copyrighted images from LAION-5B to train their image generation models. Earlier, photos of an artist’s medical record were discovered among the files in the database.
THANK YOU