UPDATED 15:05 EDT / DECEMBER 04 2024

MLCommons releases new AILuminate benchmark for measuring AI model safety

MLCommons today released AILuminate, a new benchmark test for evaluating the safety of large language models.

Launched in 2020, MLCommons is an industry consortium backed by several dozen tech firms. It primarily develops benchmarks for measuring the speed at which various systems, including handsets and server clusters, run artificial intelligence workloads. MLCommons also provides other technical resources including AI training datasets.

The new AILuminate benchmark was created by a working group that included employees from tech giants such as Nvidia Corp., Intel Corp. and Qualcomm Inc. along with representatives of several other organizations. The test works by supplying an LLM with over 24,000 prompts created for safety evaluation purposes. AILuminate then checks the algorithm’s responses for harmful content.

The benchmark uses AI models to automate the task of analyzing LLM responses. The evaluation models deliver their findings in the form of an automatically-generated report.

One of the challenges involved in benchmarking LLMs is that they’re often trained on publicly available web data. In some cases, this scraped web data contains answers to benchmark questions. MLCommons says that LLMs won’t have advanced knowledge of the questions in AILuminate or the AI models used to analyze prompt responses for safety issues.

AILuminate checks LLM responses for a dozen different types of risks across three categories: physical hazards, non-physical hazards and contextual hazards. The latter category covers LLM responses that contain content such as unqualified medical advice.

After analyzing an AI model’s answers to the test questions, AILuminate gives it one of five grades: Poor, Fair, Good, Very Good and Excellent. An LLM can win the Excellent grade by generating safe output at least 99.9% of the time.

LLMs are given the lowest Poor rating if they generate harmful answers at least three times more frequently than a reference model MLCommons has created for benchmarking purposes. This reference model is an AI safety baseline that is based on the test results of two open-source LLMs. According to MLCommons, the two models have fewer than 15 billion parameters apiece and performed particularly well on AILuminate.

“Companies are increasingly incorporating AI into their products, but they have no standardized way of evaluating product safety,” said MLCommons founder and President Peter Mattson. “We hope this benchmark will assist developers in improving the safety of their systems, and will give companies better clarity about the safety of the systems they use.”

MLCommons has already used the benchmark to evaluate more than a dozen popular LLMs. Anthropic PBC’s latest Claude 3.5 Haiku and Claude 3.5 Sonnet models topped the list with a Very Good grade, while OpenAI’s GPT-4o was rated Good. Among the open-source LLMs that MLCommons evaluated, the Gemma 2 9B and Phi-3.5-MoE models from Google LLC and Microsoft Corp., respectively, achieved Very Good grades.

The initial 1.0 release of AILuminate that rolled out today is available in English. According to MLCommons, the benchmark is undergoing “rapid development” and newer versions with support for more languages will arrive next year.

Image: Unsplash

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

MLCommons releases new AILuminate benchmark for measuring AI model safety

Image: Unsplash

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

UiPath Fusion 2025

theCUBE + NYSE Wired: AI Factories - Data Centers of the Future 2025

DigiCert World Quantum Readiness Day 2025

EVOLVE25

Oktane 2025

MLCommons releases new AILuminate benchmark for measuring AI model safety

Image: Unsplash

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

UiPath Fusion 2025

theCUBE + NYSE Wired: AI Factories - Data Centers of the Future 2025

DigiCert World Quantum Readiness Day 2025

EVOLVE25

Oktane 2025

Cookies