UPDATED 11:10 EDT / NOVEMBER 22 2023

AI

Microsoft, OpenAI sued over alleged unauthorized use of nonfiction authors’ work in AI training

Artificial intelligence startup OpenAI and Microsoft Corp. have been hit with a new lawsuit alleging that the companies violated copyright by using the works of nonfiction authors to train AI models, including OpenAI’s ChatGPT.

Julian Sancton, author of the New York Times bestseller “Madhouse at the End of the Earth” and reporter, is the principal plaintiff of the class-action suit filed in New York federal court Tuesday. It’s one among several lawsuits led by authors against OpenAI and other AI firms over copyright misuse, which included notable writers such as George R.R. Marin and John Grisham.

According to the lawsuit, filed by law firm Susman Godfrey Godfrey LLP, OpenAI allegedly scraped the content of hundreds of thousands of nonfiction books to train their AI models. Large language models such as OpenAI’s ChatGPT can understand and produce humanlike speech. To do this, the models need to ingest large bodies of text that resemble human interaction, and the more diverse the better. As a result, companies that produce LLMs gather as much data and text as possible, especially naturally written language, which often comes from books.

“Defendants took these works; they made unlicensed copies of them; and they used those unlicensed copies to digest and analyze the copyrighted expression in them, all for commercial gain,” the complaint reads. “The end result is a computer model that is not only built on the work of thousands of creators and authors, but also built to generate a wide range of expression — from short-form articles to book chapters — that mimics the syntax, style, and themes of the copyrighted works on which it was trained.”

As for the basis of the infringement, the lawsuit says nonfiction authors spend years of their lives conceiving, researching and writing their work. As such, scraping and then transforming that work without compensation constitutes wide-scale theft.

The lawsuit claims that OpenAI and Microsoft collaborated closely on the production and deployment of the models and stressed that Microsoft’s relationship made it a partner in the infringement. Microsoft has also made substantial investments, to the tune of $13 billion, in the AI startup and deeply incorporated OpenAI’s models into its products with its AI-powered Copilot capabilities and across its cloud offerings.

The defendants of the class action are asking to restrain OpenAI and Microsoft from continuing to use their nonfiction works to train the AI models. The lawsuit also seeks damages and restitution for the alleged copyright infringement already committed.

The plaintiffs in this case may have an uphill battle as this isn’t the first case that has sought to bring AI developers to heel when it comes to using copyrighted works for training models. Most recently Sarah Silverman’s lawsuit against Meta Platforms Inc. over its alleged unauthorized use of authors’ books to train its generative AI Llama 2 model hit a roadblock when U.S. District Judge Vince Chhabria trimmed her lawsuit on Monday.

The judge dismissed a number of the complaints in the lawsuit alleging that copyright infringement took place in training the model based on the core theories that the AI system was itself an infringing derivative work based only on the information scraped from copyrighted material. “This is nonsensical,” he wrote in the order. “There is no way to understand the Llama models themselves as a recasting or adaptation of any of the plaintiffs’ books.”

The lawsuit was built on the legal decision by a federal judge that clipped the wings of another lawsuit filed by three artists against generative AI image providers: Stability AI Ltd., Deviant Art Inc. and Midjourney Inc. In that decision, U.S. District Judge William Orrick found that copyright infringement claims could not proceed because the plaintiffs failed to show that the generators produced substantially similar artwork and that the lawsuit was “defective in numerous respects.”

In the case of Silverman’s lawsuit, Chhabria said that in order to prevail it must be shown that the outputs would need to “incorporate some portion of” her books, which echoed a portion of Orrik’s decision.

Going forward, lawsuits against AI model developers will most likely have to provide clear and present evidence that their works can be reproduced in whole or in part in some closely similar substance before judges will allow their lawsuits to proceed. The mere mention that their works have been scraped or read as part of the training process is insufficient to trigger copyright infringement.

Photo: Pixabay

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU