Researchers find child sexual abuse images in LAION-5B AI training dataset
Researchers have found child sexual abuse material in LAION-5B, an open-source artificial intelligence training dataset used to build image generation models.
The discovery was made by the Stanford Internet Observatory, or SIO, which detailed its findings in a Tuesday report. SIO researchers have identified more than 1,000 exploitative images of children in LAION-5B. They detailed in their report that they evaluated only a subset of the files in the database, which means it likely contains thousand of additional CSAM images that have not yet been found.
SIO identified the illegal images with a data management technique called hashing. Using the technique, researchers can turn a file into a unique series of letters and numbers called a hash. After creating hashes of the images in LAION-5B, the SIO researchers compared them against the hashes of known CMAS images.
“Removal of the identified source material is currently in progress as researchers reported the image URLs to the National Center for Missing and Exploited Children (NCMEC) in the U.S. and the Canadian Centre for Child Protection (C3P),” SIO researchers wrote. “The study was primarily conducted using hashing tools such as PhotoDNA, which match a fingerprint of an image to databases maintained by nonprofits that receive and process reports of online child sexual exploitation and abuse.”
LAION-5B was released in early 2022 by a German nonprofit that has received funding from several AI startups. The dataset comprises more than 5 billion images scraped from the web and accompanying captions. It’s an upgraded version of earlier AI training dataset, called LAION-400M, that was published by the same nonprofit a few months earlier and includes about 400 million images.
In a statement issued to Bloomberg today, the nonprofit stated that it has a zero-tolerance policy for illegal content. It has deleted multiple versions of the dataset from the internet “to ensure they are safe before republishing them.” Additionally, the nonprofit has released filters for finding and removing illegal content from its datasets.
Since its release last year, LAION-5B has been used to train multiple image generation models. SIO determined that some of those models are used to generate CSAM images.
One of the highest-profile companies to have leveraged LAION-5B to train its neural networks is Stability AI Ltd., the startup behind the popular Stable Diffusion series of image generation models. The company told Bloomberg that the relatively recent 2.0 version of Stability Diffusion wasn’t trained on LAIONk-5B, but rather a subset of the dataset with less unsafe content. Stability AI has also equipped its newer models with filters designed to block unsafe inputs and outputs.
The new SIO report doesn’t mark the first time the LAION-5B dataset has come under scrutiny. Early last year, three artists filed a lawsuit against Stability AI and two other companies that allegedly used millions of copyrighted images from LAION-5B to train their image generation models. Earlier, photos of an artist’s medical record were discovered among the files in the database.
Image: LAION
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU