UPDATED 19:34 EDT / JULY 16 2024

AI

Report: Anthropic, Nvidia, Apple and Salesforce used YouTube transcripts to train AI

A new report released today says companies that include Anthropic PBC, Nvidia Corp., Apple Inc. and Salesforce Inc. have used subtitles from YouTube videos to help train their artificial intelligence service without permission, raising questions about the ethical implications of using publicly available material and facts without consent.

The report from Proof News claims that the companies allegedly used subtitles from 173,536 YouTube videos taken from over 48,000 channels to train their AI. Anthropic, Nvidia, Apple and Salesforce are not alleged to have scraped the content, but instead are claimed to have used a dataset from a nonprofit AI group called EleutherAI.

EleutherAI focuses on the interpretability and alignment of large models. Founded in 2020, the group aims to democratize access to advanced AI technologies by developing and releasing open-source AI models like GPT-Neo and GPT-J. The organization also advocates for open science norms in natural language processing and ensures that independent researchers can study and audit AI technologies, promoting transparency and ethical AI development.

The dataset from EleutherAI used by the four companies is called “YouTube Subtitles” and is said to contain video transcripts from education and online learning channels, along with transcripts from several media outlets and YouTube stars. The transcripts from YouTubers in the dataset include those from Mr. Beast, electric car maker killer Marques Brownlee, PewDiePie and left-wing political commentator David Pakman.

Some of those who had their content in the dataset are offended, with Pakman, in particular, claiming that the use of his transcripts risks his livelihood and staff. David Wiskus, the chief executive officer of streaming service Nebula, goes as far as to claim that the use of the data is “theft.”

Despite the data being publicly available, the crime seemingly that the data is being read by large language models, the seeming storm in a teacup nonetheless comes as legal action has been taken over publicly available data being used to train AI models.

Microsoft Corp. and OpenAI were sued by for their use of nonfiction authors’ work in AI training in November. The class-action lawsuit, led by a New York Times reporter, claimed that OpenAI allegedly scraped the content of hundreds of thousands of nonfiction books to train their AI models.

The Times also accused OpenAI, Google LLC and Meta Holdings Inc. in April of skirting legal boundaries for AI training data.

Though some are calling the use of AI training data a gray area, whether it’s legal or not is yet to be extensively tested in court. And should a case end up in court, the test likely to apply is whether facts, including publicly stated utterances, can be copyrighted.

The closest case law in the U.S. pertaining to the repetition of facts covers two cases – -Feist Publications Inc. v. Rural Telephone Service Co., 499 U.S. 340 (1991) and International News Service v. Associated Press (1918). In both cases, the U.S. Supreme Court ruled that facts cannot be copyrighted.

Image: SiliconANGLE/Ideogram

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU