UPDATED 12:50 EST / JULY 10 2023

AI

Comedian Sarah Silverman sues OpenAI and Meta over copyright infringement

Comedian and author Sarah Silverman and two authors are suing the developer of ChatGPT OpenAI LP and Mark Zuckerberg’s Meta Platforms Inc., claiming that the companies used copyrighted materials from their books when training their artificial intelligence chatbots.

A pair of class action lawsuits have been filed by Silverman, as well as authors Chris Golden and Rich Kadrey, alleging that the two companies have remixed portions of their books without consent, compensation or credit. According to the lawsuit, their books were used to train OpenAI’s GPT-3.5 and GPT-4, which underlies its ChatGPT chatbot, and Meta’s LLaMA AI large language model.

Large language model AI chatbots have wowed the world with their capability to understand and respond conversationally in what sounds like human speech. They do this by adjusting to training data to resemble more closely the information ingested from large bodies of text, and the more diverse the data, the better. As a result, companies pull in as much data and text as they can – especially natural written language, such as human conversations, written interviews and especially books.

As part of the OpenAI lawsuit, the plaintiffs offered exhibits that showed that ChatGPT was capable of summarizing their books easily, which showed that it had ingested portions of the text. This goes beyond simply providing “back matter” summaries of what’s publicly available from marketing materials. Examples included asking AI to summarize entire chapters of Sarah Silverman’s “The Bedwetter,” her memoir.

“When ChatGPT was prompted to summarize books written by each of the Plaintiffs, it generated very accurate summaries,” the lawsuit said. “The summaries get some details wrong. This is expected, since a large language model mixes together expressive material derived from many sources. Still, the rest of the summaries are accurate, which means that ChatGPT retains knowledge of particular works in the training dataset.”

The lawsuit alleges that OpenAI and Meta trained their LLMs based on a large dataset of books from what is known as a “shadow library” of copyrighted works around the internet sourced from websites such as Library Genesis (also known as Libgen), Z-Library, Sci-Hub and Biblotik. Shadow library websites provide access to research papers, magazines, nonfiction and fiction books, images, comics and audiobooks without regard to copyright for mass download through link aggregation.

The Meta complaint explains how authors believe that their books were included in one of these shadow libraries and assembled by a research organization called EleutherAI consisting of a dataset called ThePile. The data was then included in the training set for Meta’s LLaMA large language model. “These shadow libraries have long been of interest to the AI-training community because of the large quantity of copyrighted material they host,” the complaint read. “For that reason, these shadow libraries are also flagrantly illegal.”

Lawyers Joseph Saveri and Matthew Butterick, who are representing Silverman and the other authors, filed a similar lawsuit against OpenAI on behalf of two other authors alleging the same issue. In 2022, they teamed up to file suit alleging that GitHub Copilot violated copyright.

The same lawyers have been behind the lawsuit against art-generating AI providers Stability AI Ltd., Midjourney Inc. and Deviant Art Inc. filed by three artists alleging that their artwork was being used without their permission or credit. Separately, Getty Images also filed a lawsuit against Stability AI, alleging it used more than 12 million copyrighted images in its training set.

All that follows an ongoing trend showing that AI’s requirement for a vast amount of training data to produce results needs to be sourced from somewhere means that the models can potentially run afoul of the rights of artists. This is only beginning to be noticed by the regulators and legal professionals, who are playing catch-up with the technology.

Legislators in the European Union have sought to adjust to this new paradigm with the upcoming passage of the Artificial Intelligence Act, which includes a requirement for AI models to disclose copyrighted material used to train models. It would provide a path for copyright holders to be compensated for its use when used by AI models.

Image: Pixabay

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU