UPDATED 13:58 EDT / AUGUST 25 2023

AI

Multiple news organizations block OpenAI’s GPTBot web crawler

by Maria Deutscher

Multiple news organizations have blocked OpenAI LP from crawling their websites, according to a new report.

The Guardian reported today that The New York Times, CNN, Reuters and the Chicago Tribune are not allowing OpenAI’s GPTBot web crawler to scan their online content. A number of media organizations in Australia were found to have taken the same step. The Australian Broadcasting Corporation, the Canberra Times and the Newcastle Herald are blocking GPTBot.

A separate analysis from startup Originality.ai found that Insider is also filtering requests from OpenAI’s crawler. According to the startup, more than 60 of the world’s 1,000 most visited websites are blocking GPTBot and the list appears to be steadily growing. The crawler has been disallowed by not only news organizations but also tech firms such as Quora Inc. and other companies.

OpenAI first detailed its GPTBot crawler early this month. On its website, the company states that content crawled by the bot could potentially be used to train artificial intelligence models.

The company has also published a guide explaining how website operators can block GPTBot. Disallowing the crawler requires adding a short code snippet to a file called robots.text that is included in many websites. The Guardian identified the news organizations that block GPTBot by analyzing their robots.text files.

The analysis revealed that some of the publishers in question have also blocked a second web crawler called CCBot. It’s used by Common Crawl, a nonprofit organization, to aggregate publicly available information from around the web. The nonprofit makes that information available to the public in the form of free datasets.

In recent years, some AI companies have used Common Crawler’s data to train machine learning models. OpenAI is reportedly among those companies. The company is said to have used Common Crawl data to train GPT-3, a predecessor to its flagship GPT-4 large language model.

The move by some publishers to block GPTBot comes as OpenAI faces multiple legal challenges over its AI training datasets.

Last month, several authors sued the company for alleged copyright infringement. Earlier, a law firm filed a class-action lawsuit that charges OpenAI and Microsoft Corp. misused open-source code to build an AI programming assistant. A different law firm recently accused OpenAI of collecting millions of Americans’ personal information without permission.

It’s believed that the company could face additional legal challenges down the road. The New York Times, one of the publications that have blocked GPTBot, is reportedly considering filing a copyright lawsuit against OpenAI.

A number of other players in the generative AI market are facing similar copyright lawsuits. According to NPR, a training dataset that misuses copyrighted works can potentially expose the AI company using it to significant financial penalties. Additionally, AI companies could potentially be ordered by a court to delete such datasets.

Image: OpenAI

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.