UPDATED 20:08 EDT / MAY 16 2024

OpenAI agrees to deal with Reddit to scrape its content for AI training

In the latest development in the war for data among artificial intelligence model builders, ChatGPT is getting a new source of up-to-date content thanks to a deal between OpenAI and Reddit Inc. that was announced today.

The partnership will enable OpenAI’s large language models, including GPT-3.5 and GPT-4, to “better understand and showcase Reddit content, especially on recent topics,” the companies said in a joint statement. In addition, the deal will see OpenAI become an advertising partner to Reddit, running ads on its popular website and app.

The deal is said to be valued at about $60 million, though a spokesperson for Reddit declined to disclose the terms of the deal when asked by Reuters.

OpenAI announced the partnership on the same day as it rolled out a number of updates to ChatGPT aimed at enhancing its data analysis capabilities, giving users the ability to interact with tables and charts and upload files from Google Drive and Microsoft OneDrive.

The AI phenom has struck a number of deals with publishers to bring more training data to its artificial intelligence models. In recent weeks, it has announced similar partnerships with the likes of the Financial Times and Dotdash Media Inc. Those initiatives followed a deal that was struck with the German publisher Axel Springer SE last year to enable ChatGPT to be trained on content from publications such as Business Insider and Politico in the U.S., and Bild and Die Welt in Germany.

By partnering with Reddit, OpenAI will be able to access that company’s Data API and obtain “real-time, structured and unique content” directly from Reddit. In addition, Reddit will add some new “AI-powered features” to its platform, but it hasn’t said what they might be.

Reddit caused some controversy last year when it announced that it will start charging developers to access its application programming interface, which provides access to its rich repository of human-generated content, including high-quality information. The move resulted in a number of popular third-party Reddit clients shutting down, leading to protests in many popular subreddits.

The company said at the time it had made the decision because a number of large AI companies were scraping its data without paying anything for it. It consequently began a policy of making money from its trove of content, notably striking a deal with Google LLC first, and then OpenAI today.

For OpenAI, the main advantage it gets from the deal is it can access a wealth of rich, up-to-date content that can aid in the training of its LLMs. Like other AI firms, OpenAI wants to diversify its training methods beyond simple internet scraping, which has become a fairly contentious issue that potentially violates a lot of copyrights. By partnering with Reddit, it knows it won’t have any legal issues if its chatbots lean too much on its content.

For Reddit, the deal brings the company a nice new revenue stream at a time when it’s facing heavy competition for advertising dollars from social media rivals such as Facebook, Instagram and TikTok.

Holger Mueller of Constellation Research Inc. said the deal makes sense for OpenAI, because like most AI vendors, it has little to no data of its own that it can use to train its AI models. As such, it makes sense for the company to obtain access to rich sources of content such as Reddit, he said.

“Moreover, Reddit itself benefits from the renumeration it recieves for its data, and perhaps also an attribution to its content when it’s used by OpenAI’s models to inform an answer, bringing more traffic to its site,” Mueller explained. ”

The analyst said the controversy around this deal is what it means for the people who actually create Reddit’s content, namely its users. “Where this will leave Reddit’s users in terms of intellectual property ownership is another story,” he said. “But it’s a reminder that when technology disruption happens, data protection and privacy are often the weakest link.”

ChatGPT gets better at analyzing data

ChatGPT is also getting a number of enhancements that aim to improve its data analysis skills, giving users the ability to interact with tables and charts via a new, expandable view, OpenAI said in a blog post. In addition, users will be able to feed files to ChatGPT directly from Google Drive and Microsoft OneDrive, and customize and download charts to embed into their presentations and documents.

OpenAI said the improvements build on ChatGPT’s existing ability to understand datasets and complete tasks associated with them. To get started, users simply upload one or more datasets, enabling ChatGPT to analyze them by writing and running Python code on their behalf.

The company says it can work with data in a number of ways. For instance, it can merge and clean large datasets, create charts based on the information in Excel files, uncover insights, create summaries and so on. The idea is that novices can perform more in-depth analyses, while advanced users can save time on tasks such as cleaning up their data.

The improved data analysis capabilities will be made available within OpenAI’s newest flagship model, GPT-4o, for ChatGPT Plus, Team and Enterprise subscribers only.

Image: Mike Wheatley

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU