UPDATED 22:13 EDT / JULY 01 2024

Anthropic launches new program to fund creation of more reliable AI benchmarks

Generative artificial intelligence startup Anthropic PBC wants to prove that its large language models are the best in the business. To do that, it has announced the launch of a new program that will incentivize researchers to create new industry benchmarks that can better evaluate AI performance and impact.

The new program was announced in a blog post published today. The company explained that it’s willing to dish out grants to any third-party organization that can come up with a better way to “measure advanced capabilities in AI models.”

Anthropic’s initiative stems from the growing criticism of existing benchmark tests for AI models, such as the MLPerf evaluations that are carried out twice annually by the nonprofit entity MLCommons. It’s generally agreed that the most popular benchmarks used to rate AI models do a poor job of assessing how the average person actually uses AI systems on a day-to-day basis.

For instance, most benchmarks are too narrowly focused on single tasks, whereas AI models such as Anthropic’s Claude and OpenAI’s ChatGPT are designed to perform a multitude of tasks. There’s also a lack of decent benchmarks capable of assessing the dangers posed by AI.

Anthropic wants to encourage the AI research community to come up with more challenging benchmarks, focused on their societal implications and their security. It’s calling for a complete overhaul of existing methodologies.

“Our investment in these evaluations is intended to elevate the entire field of AI safety, providing valuable tools that benefit the whole ecosystem,” the company stated. “Developing high-quality, safety-relevant evaluations remains challenging, and the demand is outpacing the supply.”

As an example, the startup said, it wants to see the development of a benchmark that’s better able to assess an AI model’s ability to get up to no good, such as by carrying out cyberattacks, manipulating or deceiving people, enhancing weapons of mass destruction and more. It said it wants to help develop an “early warning system” for potentially dangerous models that could pose national security risks.

It also wants to see more focused benchmarks that can rate AI system’s potential for aiding scientific studies, mitigating ingrained biases, self-censoring toxicity and conversing in multiple languages, it says.

The company believes that this will entail the creation of new tooling and infrastructure that will enable subject-matter experts to create their own evaluations for specific tasks, followed by large-scale trials that involve hundreds or even thousands of users. To get the ball rolling, it has hired a full-time program coordinator, and in addition to providing grants, it will give researchers the opportunity to discuss their ideas with its own domain experts, such as its red team, fine-tuning, trust and safety teams.

Additionally, it said it may even invest in or acquire the most promising projects that arise from the initiative. “We offer a range of funding options tailored to the needs and stage of each project,” the company said.

Anthropic isn’t the only AI startup pushing for the adoption of newer, better benchmarks. Last month, a company called Sierra Technologies Inc. announced the creation of a new benchmark test called “𝜏-bench” that’s designed to evaluate the performance of AI agents, which are models that go further than simply engaging in conversation, performing tasks on behalf of users when they’re requested to do so.

But there are reasons to be distrustful of any AI company that’s looking to establish new benchmarks, because it’s clear that there are commercial benefits to be had if it can use those tests as proof of its AI models’ superiority over others.

With regard to Anthropic’s initiative, it said in its blog post that it wants researchers’ benchmarks to align with its own AI safety classifications, which were developed by itself with input from third-party AI researchers. As a result, there’s a risk that AI researchers might be forced to accept definitions of AI safety that they don’t necessarily agree with.

Still, Anthropic insists that the initiative is meant to serve as a catalyst for progress across the wider AI industry, paving the way for a future where more comprehensive evaluations become the norm.

Image: SiliconANGLE/Microsoft Designer

A message from John Furrier, co-founder of SiliconANGLE:

Support our open free content by sharing and engaging with our content and community.

Join theCUBE Alumni Trust Network

Where Technology Leaders Connect, Share Intelligence & Create Opportunities

11.4k+

CUBE Alumni Network

C-level and Technical

Domain Experts

15M+

theCUBE

Viewers

Connect with 11,413+ industry leaders from our network of tech and business leaders forming a unique trusted network effect.

SiliconANGLE Media is a recognized leader in digital media innovation serving innovative audiences and brands, bringing together cutting-edge technology, influential content, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — such as those established in Silicon Valley and the New York Stock Exchange (NYSE) — SiliconANGLE Media operates at the intersection of media, technology, and AI. .

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a powerful ecosystem of industry-leading digital media brands, with a reach of 15+ million elite tech professionals. The company’s new, proprietary theCUBE AI Video cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Anthropic launches new program to fund creation of more reliable AI benchmarks

Image: SiliconANGLE/Microsoft Designer

A message from John Furrier, co-founder of SiliconANGLE:

Join theCUBE Alumni Trust Network

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

Data Protection & AI Summit

AWS & Ecosystem Leaders Halftime Report - 2025

Black Hat USA 2025

VMware Explore 2025

CrowdStrike Fal.Con 2025

RECENT CUBE EVENTS

Open Source Summit NA 2025

theCUBE + NYSE Wired: Robotics & AI Infrastructure Leaders 2025

AppDev Done Right Summit 2025

Broadcom Delivers the Modern Private Cloud 2025

Databricks Data + AI Summit 2025

Anthropic launches new program to fund creation of more reliable AI benchmarks

Image: SiliconANGLE/Microsoft Designer

A message from John Furrier, co-founder of SiliconANGLE:

Join theCUBE Alumni Trust Network

LATEST STORIES

LATEST STORIES

Data Protection & AI Summit

AWS & Ecosystem Leaders Halftime Report - 2025

Black Hat USA 2025

VMware Explore 2025

CrowdStrike Fal.Con 2025

Open Source Summit NA 2025

theCUBE + NYSE Wired: Robotics & AI Infrastructure Leaders 2025

AppDev Done Right Summit 2025

Broadcom Delivers the Modern Private Cloud 2025

Databricks Data + AI Summit 2025

Cookies