Midrange and open-source large language models earn top marks in new AI accuracy benchmark
Artificial intelligence startup Galileo Technologies Inc. today released the results of a benchmark test that compared the accuracy of the industry’s most popular large language models.
The Hallucination Index, as the benchmark is called, evaluated 12 open-source and 10 proprietary LLMs. Galileo measured the models’ accuracy across three task collections. Some task collections were completed with perfect accuracy by LLMs based on open-source and cost-optimized designs, demonstrating that such models can provide a competitive alternative to frontier AI systems.
“Our goal wasn’t to just rank models, but rather give AI teams and leaders the real-world data they need to adopt the right model, for the right task, at the right price,” said Galileo co-founder and Chief Executive Officer Vikram Chatterji.
San Francisco-based Galileo is backed by more than $20 million in venture funding. It provides a cloud-based platform that AI teams can use to measure the accuracy of their neural networks and debug technical issues. In May, the company updated the software with a tool for protecting LLMs from malicious input.
Galileo evaluated the models it assessed as part of the Hallucination Index benchmark using a feature of its platform called Context Adherence. According to the company, the feature works by providing an LLM with a test prompt and then measuring the quality of its response using a second LLM. Galileo used OpenAI’s flagship GPT-4o model to assess AI responses.
Each of the test prompts in the Hallucination Index comprised a question and a piece of text that contained the answer. The 22 LLMs that Galileo evaluated were given the task of deducing the answer to the question from the provided text.
The most accurate of the LLMs that the company evaluated was Anthropic PBC’s Claude 3.5 Sonnet. It’s the midrange model in a planned LLM series that Anthropic began rolling out last month. Claude 3.5 Sonnet is a scaled-down, less expensive version of the most advanced model in the series, which has not yet been publicly released.
Each LLM that Galileo evaluated received three sets of questions as part of the test. Prompts in the first set had up to 5,000 tokens of data, while the second set comprised questions with between 5,000 to 25,000 tokens. The questions in the third set ranged from 40,000 to 100,000 tokens. Claude 3.5 Sonnet completed the second and third task collections with perfect accuracy, while its responses to the first set scored 0.97 out of 1.
Galileo ranked Google LLC’s Gemini 1.5 Flash as the language model that provides the best value for money. The lightweight LLM, which debuted in May, costs nearly 10 times less to use than what Anthropic charges for Claude 3.5 Sonnet. Google’s model achieved accuracy scores of 0.94, 1, and 0.92 across the Hallucination Index’s short, medium and long prompt collections, respectively.
An LLM called Qwen-2-72b-instruct from Alibaba Group Holding Ltd. achieved the highest score among the open-source models that Galileo tested. It answered the medium-length prompts that contained 5,000 to 25,000 tokens apiece with perfect accuracy. Galileo pointed out that Qwen-2-72b-instruct can process prompts with up to 128,000 tokens, significantly more than the amount of data supported by the other open-source LLMs the company evaluated.
Image: Unsplash
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU