UPDATED 19:34 EST / JULY 16 2024

AI

Report: Anthropic, Nvidia, Apple and Salesforce used YouTube transcripts to train AI

A new report released today says companies that include Anthropic PBC, Nvidia Corp., Apple Inc. and Salesforce Inc. have used subtitles from YouTube videos to help train their artificial intelligence service without permission, raising questions about the ethical implications of using publicly available material and facts without consent.

The report from Proof News claims that the companies allegedly used subtitles from 173,536 YouTube videos taken from over 48,000 channels to train their AI. Anthropic, Nvidia, Apple and Salesforce are not alleged to have scraped the content, but instead are claimed to have used a dataset from a nonprofit AI group called EleutherAI.

EleutherAI focuses on the interpretability and alignment of large models. Founded in 2020, the group aims to democratize access to advanced AI technologies by developing and releasing open-source AI models like GPT-Neo and GPT-J. The organization also advocates for open science norms in natural language processing and ensures that independent researchers can study and audit AI technologies, promoting transparency and ethical AI development.

The dataset from EleutherAI used by the four companies is called “YouTube Subtitles” and is said to contain video transcripts from education and online learning channels, along with transcripts from several media outlets and YouTube stars. The transcripts from YouTubers in the dataset include those from Mr. Beast, electric car maker killer Marques Brownlee, PewDiePie and left-wing political commentator David Pakman.

Some of those who had their content in the dataset are offended, with Pakman, in particular, claiming that the use of his transcripts risks his livelihood and staff. David Wiskus, the chief executive officer of streaming service Nebula, goes as far as to claim that the use of the data is “theft.”

Despite the data being publicly available, the crime seemingly that the data is being read by large language models, the seeming storm in a teacup nonetheless comes as legal action has been taken over publicly available data being used to train AI models.

Microsoft Corp. and OpenAI were sued by for their use of nonfiction authors’ work in AI training in November. The class-action lawsuit, led by a New York Times reporter, claimed that OpenAI allegedly scraped the content of hundreds of thousands of nonfiction books to train their AI models.

The Times also accused OpenAI, Google LLC and Meta Holdings Inc. in April of skirting legal boundaries for AI training data.

Though some are calling the use of AI training data a gray area, whether it’s legal or not is yet to be extensively tested in court. And should a case end up in court, the test likely to apply is whether facts, including publicly stated utterances, can be copyrighted.

The closest case law in the U.S. pertaining to the repetition of facts covers two cases – -Feist Publications Inc. v. Rural Telephone Service Co., 499 U.S. 340 (1991) and International News Service v. Associated Press (1918). In both cases, the U.S. Supreme Court ruled that facts cannot be copyrighted.

Image: SiliconANGLE/Ideogram

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.