UPDATED 13:58 EDT / AUGUST 25 2023

AI

Multiple news organizations block OpenAI’s GPTBot web crawler

Multiple news organizations have blocked OpenAI LP from crawling their websites, according to a new report.

The Guardian reported today that The New York Times, CNN, Reuters and the Chicago Tribune are not allowing OpenAI’s GPTBot web crawler to scan their online content. A number of media organizations in Australia were found to have taken the same step. The Australian Broadcasting Corporation, the Canberra Times and the Newcastle Herald are blocking GPTBot. 

A separate analysis from startup Originality.ai found that Insider is also filtering requests from OpenAI’s crawler. According to the startup, more than 60 of the world’s 1,000 most visited websites are blocking GPTBot and the list appears to be steadily growing. The crawler has been disallowed by not only news organizations but also tech firms such as Quora Inc. and other companies. 

OpenAI first detailed its GPTBot crawler early this month. On its website, the company states that content crawled by the bot could potentially be used to train artificial intelligence models.

The company has also published a guide explaining how website operators can block GPTBot. Disallowing the crawler requires adding a short code snippet to a file called robots.text that is included in many websites. The Guardian identified the news organizations that block GPTBot by analyzing their robots.text files.

The analysis revealed that some of the publishers in question have also blocked a second web crawler called CCBot. It’s used by Common Crawl, a nonprofit organization, to aggregate publicly available information from around the web. The nonprofit makes that information available to the public in the form of free datasets.

In recent years, some AI companies have used Common Crawler’s data to train machine learning models. OpenAI is reportedly among those companies. The company is said to have used Common Crawl data to train GPT-3, a predecessor to its flagship GPT-4 large language model. 

The move by some publishers to block GPTBot comes as OpenAI faces multiple legal challenges over its AI training datasets. 

Last month, several authors sued the company for alleged copyright infringement. Earlier, a law firm filed a class-action lawsuit that charges OpenAI and Microsoft Corp. misused open-source code to build an AI programming assistant. A different law firm recently accused OpenAI of collecting millions of Americans’ personal information without permission.

It’s believed that the company could face additional legal challenges down the road. The New York Times, one of the publications that have blocked GPTBot, is reportedly considering filing a copyright lawsuit against OpenAI.

A number of other players in the generative AI market are facing similar copyright lawsuits. According to NPR, a training dataset that misuses copyrighted works can potentially expose the AI company using it to significant financial penalties. Additionally, AI companies could potentially be ordered by a court to delete such datasets. 

Image: OpenAI

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU