UPDATED 19:17 EDT / APRIL 07 2024

AI

OpenAI, Google and Meta accused of skirting legal boundaries for AI training data

Some three months after suing OpenAI for alleged copyright infringement, the New York Times Co. claims in a new report Saturday that OpenAI, Google LLC and Meta Platform may have acted dubiously in training their artificial intelligence models.

The report opens by targeting OpenAI, claiming that the company used a speech recognition tool called Whisper to transcribe audio from YouTube videos and generate new conversational text for AI training. In an apparent revelation, the report then claims that OpenAI staff discussed whether the decision to transcribe YouTube videos may go against the video site’s rules.

It’s then revealed that OpenAI did transcribe more than 1 million hours of YouTube videos and that this was assisted by OpenAI President Greg Brockman. The transcriptions were then used as part of training GPT-4.

“AI has become a desperate hunt for the digital data needed to advance the technology,” the report claims, before adding that “to obtain that data, tech companies including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law.”

The report then claims that Meta apparently considered buying publisher Simon & Schuster LLC to procure long works to assist in training their AI and also discussed “gathering copyrighted data from across the internet, even if that meant facing lawsuits” and that “negotiating licenses with publishers, artists, musicians and the news industry would take too long.”

Google is then accused of transcribing YouTube videos to harvest text for its AI models, which the Times reports of “potentially” violating the copyrights of the videos and also changing its terms to allow scraping of publicly available Google Docs, restaurants reviews on Google Maps and other online material to train their AI.

Given that language, the New York Times appears to be trying to paint a scary picture of wholesale copyright theft while often avoiding directly saying so. Google didn’t steal transcriptions, it “potentially” violated copyright; Meta discussed the legality of scraping public data; and OpenAI discussed whether transcribing YouTube might go against some rules.

Those are all reasonable conversations any company developing AI should have when it comes to playing nicely with others and complying with the law. The law is still very gray around fair use and data for AI and the Times knows that, or it wouldn’t be suing OpenAI.

Notably, fair use is at the core of what AI companies are doing and it’s also key to OpenAI’s defense to the Times’ lawsuit. The AI developer argues that training AI models using publicly available content is fair use.

Another telling feature of the article is that it takes 17 paragraphs for the New York Times article to disclose that it’s suing OpenAI over some of the allegations in the report, making the article, intentionally or not, read like an attack piece against what the company sees as its enemies.

Photo: Wikimedia Commons

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU