MLCommons releases new AILuminate benchmark for measuring AI model safety
MLCommons today released AILuminate, a new benchmark test for evaluating the safety of large language models.
Launched in 2020, MLCommons is an industry consortium backed by several dozen tech firms. It primarily develops benchmarks for measuring the speed at which various systems, including handsets and server clusters, run artificial intelligence workloads. MLCommons also provides other technical resources including AI training datasets.
The new AILuminate benchmark was created by a working group that included employees from tech giants such as Nvidia Corp., Intel Corp. and Qualcomm Inc. along with representatives of several other organizations. The test works by supplying an LLM with over 24,000 prompts created for safety evaluation purposes. AILuminate then checks the algorithm’s responses for harmful content.
The benchmark uses AI models to automate the task of analyzing LLM responses. The evaluation models deliver their findings in the form of an automatically-generated report.
One of the challenges involved in benchmarking LLMs is that they’re often trained on publicly available web data. In some cases, this scraped web data contains answers to benchmark questions. MLCommons says that LLMs won’t have advanced knowledge of the questions in AILuminate or the AI models used to analyze prompt responses for safety issues.
AILuminate checks LLM responses for a dozen different types of risks across three categories: physical hazards, non-physical hazards and contextual hazards. The latter category covers LLM responses that contain content such as unqualified medical advice.
After analyzing an AI model’s answers to the test questions, AILuminate gives it one of five grades: Poor, Fair, Good, Very Good and Excellent. An LLM can win the Excellent grade by generating safe output at least 99.9% of the time.
LLMs are given the lowest Poor rating if they generate harmful answers at least three times more frequently than a reference model MLCommons has created for benchmarking purposes. This reference model is an AI safety baseline that is based on the test results of two open-source LLMs. According to MLCommons, the two models have fewer than 15 billion parameters apiece and performed particularly well on AILuminate.
“Companies are increasingly incorporating AI into their products, but they have no standardized way of evaluating product safety,” said MLCommons founder and President Peter Mattson. “We hope this benchmark will assist developers in improving the safety of their systems, and will give companies better clarity about the safety of the systems they use.”
MLCommons has already used the benchmark to evaluate more than a dozen popular LLMs. Anthropic PBC’s latest Claude 3.5 Haiku and Claude 3.5 Sonnet models topped the list with a Very Good grade, while OpenAI’s GPT-4o was rated Good. Among the open-source LLMs that MLCommons evaluated, the Gemma 2 9B and Phi-3.5-MoE models from Google LLC and Microsoft Corp., respectively, achieved Very Good grades.
The initial 1.0 release of AILuminate that rolled out today is available in English. According to MLCommons, the benchmark is undergoing “rapid development” and newer versions with support for more languages will arrive next year.
Image: Unsplash
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU