UPDATED 19:17 EDT / APRIL 07 2024

OpenAI, Google and Meta accused of skirting legal boundaries for AI training data

Some three months after suing OpenAI for alleged copyright infringement, the New York Times Co. claims in a new report Saturday that OpenAI, Google LLC and Meta Platform may have acted dubiously in training their artificial intelligence models.

The report opens by targeting OpenAI, claiming that the company used a speech recognition tool called Whisper to transcribe audio from YouTube videos and generate new conversational text for AI training. In an apparent revelation, the report then claims that OpenAI staff discussed whether the decision to transcribe YouTube videos may go against the video site’s rules.

It’s then revealed that OpenAI did transcribe more than 1 million hours of YouTube videos and that this was assisted by OpenAI President Greg Brockman. The transcriptions were then used as part of training GPT-4.

“AI has become a desperate hunt for the digital data needed to advance the technology,” the report claims, before adding that “to obtain that data, tech companies including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law.”

The report then claims that Meta apparently considered buying publisher Simon & Schuster LLC to procure long works to assist in training their AI and also discussed “gathering copyrighted data from across the internet, even if that meant facing lawsuits” and that “negotiating licenses with publishers, artists, musicians and the news industry would take too long.”

Google is then accused of transcribing YouTube videos to harvest text for its AI models, which the Times reports of “potentially” violating the copyrights of the videos and also changing its terms to allow scraping of publicly available Google Docs, restaurants reviews on Google Maps and other online material to train their AI.

Given that language, the New York Times appears to be trying to paint a scary picture of wholesale copyright theft while often avoiding directly saying so. Google didn’t steal transcriptions, it “potentially” violated copyright; Meta discussed the legality of scraping public data; and OpenAI discussed whether transcribing YouTube might go against some rules.

Those are all reasonable conversations any company developing AI should have when it comes to playing nicely with others and complying with the law. The law is still very gray around fair use and data for AI and the Times knows that, or it wouldn’t be suing OpenAI.

Notably, fair use is at the core of what AI companies are doing and it’s also key to OpenAI’s defense to the Times’ lawsuit. The AI developer argues that training AI models using publicly available content is fair use.

Another telling feature of the article is that it takes 17 paragraphs for the New York Times article to disclose that it’s suing OpenAI over some of the allegations in the report, making the article, intentionally or not, read like an attack piece against what the company sees as its enemies.

Photo: Wikimedia Commons

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

OpenAI, Google and Meta accused of skirting legal boundaries for AI training data

Photo: Wikimedia Commons

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

KubeCon + CloudNativeCon EU 2026

RSAC 2026 Conference

Nvidia GTC 2026

Google Cloud AI Agents in Action Series 2025/2026

MWC Barcelona 2026

OpenAI, Google and Meta accused of skirting legal boundaries for AI training data

Photo: Wikimedia Commons

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

KubeCon + CloudNativeCon EU 2026

RSAC 2026 Conference

Nvidia GTC 2026

Google Cloud AI Agents in Action Series 2025/2026

MWC Barcelona 2026

Cookies