UPDATED 09:00 EDT / SEPTEMBER 23 2024

AI

Cloudflare debuts tools for website owners to charge AI companies that scrape their content

Earlier this year, Cloudflare Inc. announced a simple tool for website owners to prevent artificial intelligence model developers from scraping their online content. Now, it’s building on that with additional capabilities that can help website owners to control how their content is used by AI models, and even try to make money from it.

The company said its AI Audit product provides a suite of tools to help customers understand how AI models are using their content. Once they know what their content is being used for, they’ll then be able to decide if they’re willing to let AI developers access it or not. Moreover, they’ll also be able to set what they consider is a “fair price” for AI scrapers to use their content for model training and other purposes.

The practice of scraping websites for content has become extremely common in the AI industry, with the internet providing a treasure trove of ostensibly “free” data that can be used to train AI models. But this mass scraping of websites is controversial too, with many content creators and publishers arguing that it’s unfair, especially since they’re unaware it’s happening.

The biggest AI providers today are all guilty of scraping content from the web, including the likes of OpenAI, Google LLC, Meta Platforms Inc., Stability AI Ltd., IBM Corp. and Microsoft Corp. These companies all openly admit to helping themselves to publishers’ content, arguing that the practice falls under the “fair use” doctrine.

But critics say that it’s having a detrimental impact on publishers, since they lose out on web traffic as a result of having their content scraped. For example, a website that posts food recipes will lose a ton of traffic – and potential revenue – to AI chatbots that use their content to quickly respond to requests for a recipe. Because the chatbot provides the user with all of the information they’ve asked for, there’s little incentive for anyone to actually go visit that website, even if the chatbot cites it as the source of its response.

Some publishers have responded to this by taking steps to block AI developers from accessing their websites. Last month, the Guardian reported that The New York Times, CNN, Reuters and the Chicago Tribune had all blocked OpenAI’s GPTBot web crawler from scanning their websites.

Meanwhile, others have countered by enabling AI developers to access their content for a price. Reddit Inc., one of the world’s busiest forums, said in April it is launching an application programming interface that will enable AI companies to pay to access its content, ensuring it is fairly compensated.

Giving control back to creators

With its latest update today, Cloudflare says, it’s helping every website developer to do something similar. AI Audit is designed to give control back to content creators, so there can be a more transparent exchange between the two parties.

It includes a simple, one-click tool that automatically prevents every kind of AI scraper from accessing their content, plus a suite of analytics tools that can help website owners to understand what AI bots are doing on their properties. According to Cloudflare, it can help site owners to understand why, when and how often AI models are accessing their web pages, and even make a distinction between AI bots that credit the source of their data and those that don’t.

In addition, Cloudflare’s AI Audit also provides a tool for website owners to determine a fair price for allowing bots to access their content, based on the standard going rates negotiated by bigger publishers such as Reddit. Cloudflare says this is necessary because many smaller site owners lack the resources and expertise to understand the value of their content and negotiate deals with AI companies. Moreover, the AI companies themselves simply don’t have the bandwidth to cut a deal with every single website they scrape, because there are millions of them.

Cloudflare’s AI Audit tab helps to define the metrics that are commonly used to establish a fair price for scraping, such as the rate of crawling for certain sections of content of an entire page or website. Based on this data, it will then recommend a price and transaction flow. That enables AI developers quickly find new sources of content and pay for them, compensating the creators.

Holger Mueller of Constellation Research Inc. told SiliconANGLE that data makes all the difference between good and bad AI models, and the public internet is perhaps the biggest single source of freely available information developers can get.

“Data scraped from websites has been instrumental in the rise of generative AI but there are legal and moral arguments that most content posted online is proprietary and confidential, even if anyone can see it,” Mueller said. “Content creators are keen to protect the data they create and post online because they want to be the biggest beneficiaries of it, so it makes sense for Cloudflare to give them a way to prevent it from being scraped.”

Cloudflare co-founder and Chief Executive Matthew Prince said AI will forever transform the way people interact with content online, so it’s necessary for every stakeholder to get together and determine what this future will look like. But he believes it’s important for content creators to be able to own and control their content.

“If content creators don’t have this control, the quality of online information will deteriorate or be locked exclusively behind paywalls,” Prince said. “With Cloudflare’s scale and global infrastructure, we believe we can provide the tools and set the standards to give websites, publishers, and content creators control and fair compensation for their contribution to the Internet, while still enabling AI model providers to innovate.”

Image: SiliconANGLE/Microsoft Designer

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU