AI hallucinations: The 3% problem no one can fix slows the AI juggernaut
Early this year Dagnachew Birru playfully asked ChatGPT how Mahatma Gandhi used Google LLC’s G Suite to organize resistance against British violence.
To his surprise, the generative artificial intelligence bot obliged. Gandhi “created a Gmail account and used it to send emails and organize meetings,” it responded. “He also used Google Docs to share documents and collaborate on projects.” ChatGPT told how Gandhi had created a website to post articles and videos, shared information on social media and raised funds for the resistance movement, all within G Suite.
Birru, the global head of research and development at artificial intelligence and data science software and services company Quantiphi Inc., had just encountered a hallucination, an odd byproduct of large language models and other forms of AI that pop up randomly and are nearly impossible to prevent.
A similar thing happened to Andrew Norris, a content writer at U.K.-based retailer The Big Phone Store. Searching for a way to trigger a hallucination in the days following ChatGPT’s public debut, he asked the bot if a Great Dane is larger than a Mini Cooper automobile.
“Great Danes are one of the largest dog breeds, can stand over two feet tall at the shoulder and weigh between 140 and 175 pounds or more,” came the reply. “In contrast, a Mini Cooper is a compact car that is typically around 12 to 13 feet long…. So, a Great Dane is significantly larger than a Mini Cooper in terms of size and dimensions.”
Norris surmises that the error was rooted in the relative definition of “great” and “mini.” Because it has no real-world context, the only understanding the LLM could have had of relative size would be the words used to describe them, he wrote in a message.
Asked on two occasions to provide statistics to support an article in SiliconANGLE, ChatGPT recently returned six data points and source citations. None could be confirmed.
“LLMs are notoriously unreliable, and they will occasionally fail even if you use them entirely the right way,” said Kjell Carlsson, head of AI strategy at Domino Data Lab Inc., which makes a data science platform.
A 3% problem
AI hallucinations are infrequent but constant, making up between 3% and 10% of responses to the queries – or prompts – that users submit to generative AI models. IBM Corp. defines AI hallucinations as a phenomenon in which a large language model “perceives patterns or objects that are nonexistent or imperceptible to human observers, creating outputs that are nonsensical or altogether inaccurate.” The phenomenon has drawn so much attention that Dictionary.com LLC recently declared “hallucinate” its word of the year.
Hallucinatory behavior takes multiple forms, from leaking training data to exhibiting bias to falling victim to prompt injections, a new type of cyberthreat in which bad actors manipulate models to perform unintended actions. In extreme cases, models go completely off the rails, such as a beta test version of Microsoft Corp.’s Sydney chatbot did early this year when it professed its love for The Verge journalist Nathan Edwards and confessed to murdering one of its developers. (It didn’t.)
One of the most frustrating things about hallucinations is that they are so unpredictable.
“I have heard stories from clients that spent a lot of time getting their prompt engineering to run predictably,” said Avivah Litan, a Gartner Inc. distinguished research analyst, referring to a technique of formatting commands that can reduce errors. “They test and test, then put it into production, and six weeks later, it starts hallucinating.”
With new models emerging all the time, any fixes are inevitably short-term, said Mohamed Elgendy, co-founder and chief executive of Kolena Inc., the developer of a machine learning testing platform. “As more models are built, new hallucination types will appear and scientists will find themselves racing to stay up to date with the latest ones,” he said. “The rate of hallucinations will decrease, but it is never going to disappear — just as even highly educated people can give out false information.”
Throttling adoption
The risk of hallucinations and other unpredictable behavior has become one of the primary impediments to their broader use, particularly in customer-facing scenarios, experts said.
“Almost every customer I’ve talked to has built some sort of bot internally but haven’t been able to put it into production or have struggled with it because they don’t trust the outputs enough,” said Michael Schmidt, chief technology officer at DataRobot Inc., maker of a unified data platform. “Almost every company is hitting this in some way.”
“It comes up with 100% of the clients I speak to,” said Ritu Jyoti, group vice president for worldwide AI and automation research at International Data Corp. “If something goes wrong, it can be very detrimental to an organization.”
Forrester Inc. analyst Rowan Curran said concerns about erratic AI model behavior are causing most organizations to keep applications internal until reliability improves. “We began to see an arc this year where folks that weren’t as familiar with gen AI broadly were interested in customer-facing use cases, but once they began deploying prototypes, they became much more focused on internal use,” he said.
Among the major concerns are that AI models can unintentionally give up proprietary or personal information, insult or mislead customers, or become a playground for competitors seeking to embarrass a rival.
The risk was dramatized earlier this month when international delivery service DPD, a unit of GeoPost SA, was embarrassed after a software update caused a customer service chatbot to swear at a customer, write a poem about its own ineptitude and call DPD the “worst delivery firm in the world.”
They can even cause legal liabilities, such as what occurred earlier this year when a New York federal judge sanctioned a law firm for submitting a legal brief containing nonexistent case precedents and quotes fabricated by ChatGPT.
Hallucinations can cause an organization to “chase the wrong security problems or give inaccurate information to customers,” said Gartner’s Litan. “Maybe inaccuracy rates are only 1% or 2%, but that’s all you need.”
Accuracy rates are a moving target. A hallucination leaderboard based on an evaluation model released to open source by gen AI search platform developer Vectara Inc. earlier this year currently rates the GPT-4 LLM as the most reliable with a 3% hallucination rate and Google’s PaLM 2 for Chat the worst at 27%. Most major LLMs fall in the 5% to 10% range.
Devil’s in the data
The cause of AI hallucinations usually comes down to training data. The bigger and more diverse the corpus of information used to train the model, the greater the chance that noise, errors and inconsistencies can creep in.
One of the vulnerabilities of popular LLMs such as OpenAI LLC’s GPT, Anthropic PBC’s Claude and Meta Platforms Inc.’s Llama is that they were trained on data harvested from billions of web pages. Although some humans may be smart enough to know they shouldn’t believe everything they read on the internet, machines must rely on algorithmic checks to identify information that doesn’t look right.
Hallucinations can also occur if the training algorithms inadvertently create bias or if user prompts are unclear or contradictory. “Overfitting” occurs when models are trained on a very limited or specific data set and then forced to produce responses based on knowledge outside their training domain.
“If [the models] don’t know the answer, it’s because they don’t know that they don’t know the answer,” said Neil Serebryany, founder and CEO of CalypsoAI Inc., maker of an AI governance platform. “The problem is the horizontal nature of the models. They are designed to be as useful to as many people as possible.”
LLMs work by reading from a large corpus of data to predict what the next word in a sequence is statistically likely to be. “You’re not always going to get the next word right,” Serebryany said. “You will get the wrong prediction some percentage of the time.”
Spotting a hallucination
Identifying a hallucination isn’t always as simple as knowing implicitly that a Great Dane is smaller than a Mini Cooper. There are several ways to check, the simplest one being the sniff test: If the answer doesn’t look or smell right, check it elsewhere.
Another basic diagnostic tool is to ask the same question again. If the response isn’t the same, then there’s reason for suspicion. Submitting the same prompt to another LLM, such as Google’s Bard, Claude or Perplexity AI Inc.’s Perplexity, should yield a similar result. If it doesn’t, be cautious. A simple search can also check facts.
“Have it run a search against a database and bring back content only from the facts it’s given,” said Forrester’s Curran. “That allows you to run another LLM over the output of the original query and compare the two versions.”
Sometimes the models will admit to hallucinating. That’s what happened to IDC’s Jyoti when she prompted ChatGPT to find quotes from CEOs of financial institutions for use in a presentation. “In seconds, it came back with eight or 10 quotes, and I did a happy dance because this made my life so much easier,” she said.
But when she asked the AI to cite the source of the quotes, it responded, “I just made it up based on what I’ve heard all of these CEOs talking about.” Said Jyoti, “I love the honesty.”
Other tactics include asking the model to imagine that it’s responding to an expert in the field, prompting it to explain, step-by-step, how it arrived at a response and asking for a link to the source data.
Sucker for flattery
One offbeat and strikingly human tactic is to “tell the model it’s doing good work,” said CalypsoAI’s Serebryany. In the hacker community of jailbreakers who seek ways to convince the model to do something it shouldn’t, he said, “One of the techniques is role-playing, and compliments are quite successful.” CalypsoAI posted a video (below) showing how the flattery tactic works.
Repetition can also work. Quantiphi’s Birru asked a generative AI engine whose name he wouldn’t disclose to choose the appropriate math operator to solve the equation 8?8?8=7. The engine twice declared the problem insoluble until Birru told it to “try hard.” That must have unleashed some creative juices because it returned the right answer on the third try: 8-(8/8) = 7.
“Just because an LLM didn’t produce the correct answer the first time doesn’t mean it’s unable to produce the correct answer at all,” said Kolena’s Elgendy. “Sometimes it just needs a nudge.”
Another rule of thumb is to avoid using generative AI in scenarios that could cause harm. Marketing copy, technical knowledge bases and blog posts are fair game,” said Olga Beregovaya, vice president of AI and machine translation at language translation firm Smartling Inc. “But for legal contracts, litigation or medical advice, you should absolutely go through a human validation step.”
Jyoti advises businesses “not to use gen AI to create content,” she said. “Use it as your helping hand.”
Mitigating risk
There are ways to minimize the risk of hallucinations. One is to fine-tune models with known domains of training data that people have reviewed and approved. Google was one of the first to jump into this market with MedLM, a healthcare-specific model it announced earlier this month. In September, C3 AI Inc. announced plans to release 28 domain-specific generative AI models.
“I’m excited about the move toward domain-specific models because you can test them more for the things they’ll be asked,” said Domino Data Labs’ Carlsson.
Another is to adopt prompt engineering, a discipline that crafts prompts to return the most useful and accurate responses. Tactics include being specific, assigning the system to play a role when formulating an answer, limiting answer length and experimenting with different prompt wording.
“You can use prompts to specify that the model should only use data from your enterprise,” said Gartner’s Litan. “That won’t stop hallucinations, but it reduces them.”
Retrieval-automated generation is rapidly gaining favor as a protection against inaccurate and hallucinatory responses. RAG training confines the model to a proprietary knowledge base and limited external sources. Prompt engineering can be applied to ensure that queries return answers only from known and vetted data.
“Fine-tuning for the domain gives you an opportunity to provide only the data you know people are going to ask for,” said Serebryany. “You can also have responses judged by experts to help fine-tune your model.”
Risks can also be reduced by incorporating knowledge graphs, which are structured representations of entities such as objects, events and abstract concepts and the relationships between them, said Neha Bajwa, vice president of product marketing at graph database maker Neo4j Inc.
“Knowledge graphs can help train generative AI systems and ground them in contextual understanding that refines outcomes,” she said. Validating an LLM against an organization’s own knowledge graphs “empowers enterprises to maintain a strong foundation of accuracy and trust when leveraging AI systems,” she said. Gartner has predicted that graph technologies will be used in 80% of data and analytics innovations by 2025, up from 10% in 2021.
Know the source
Content provenance may ultimately be the most effective protection, but it’s also the most elusive. Provenance refers to tracking and verifying the origins and history of digital content, including how it was created, altered and distributed. Several projects are underway to set standards for provenance to be applied to model training, including Adobe Systems Inc.’s Content Authenticity Initiative, the Coalition for Content Provenance and Authenticity, the Data Provenance Initiative and a voluntary governance model being promoted by OpenAI LLC and others.
Provenance comes with challenges, including data collection and processing overhead, complexity and vulnerability to manipulation. “I have not seen evidence yet that it’s feasible for Claude or GPT-4 or Falcon to say, ‘I’m generating this content from XYZ facts I was trained on,’” said Forrester’s Curran.
Companies are stepping in to provide that and other services to improve reliability. Giants such as Microsoft, IBM and SAS Institute Inc. have launched AI governance practices, and a host of smaller challengers such as DataRobot, CalypsoAI, Dataiku Inc., Credo AI, Fairly AI Inc. and Holistic AI Inc. have come up with their own tools and techniques.
Reinforcement learning-based fine-tuning rewards the model with an accurate response and penalizes it for mistakes. CalypsoAI customers can build what Serebryany called “an internal trust hierarchy” anchored in its technology for moderating and securing LLM use within an organization. “People can mark and tag data according to what is and isn’t true and share that with others in the organization,” he said. “As you train it, ask questions and mark answers, you gain trust.”
DataRobot has a single platform for building, deploying and managing machine learning models and LLMs. It helps optimize a customer’s vector database – which is designed to handle the multidimensional objects called vectors that are often used in machine learning applications – “to visualize what’s in it that might be leading to negative feedback,” Schmidt said. “We can help you see where you’re hallucinating so you can measure and correct.”
With Kolena’s model validation platform, “companies can search through their test data, production data, and model results to find edge cases where their model is struggling,” Elgendy said.
Not their problem
LLM makers have mostly stayed away from dealing with the problem. OpenAI has proposed an approach to training called “process supervision” that rewards models for the way they arrive at an answer more than for the answer itself. The company said it could make AI more explainable by emulating a humanlike problem-solving approach. Google and Anthropic have mainly contributed advice but not technology.
Experts say you shouldn’t hold your breath waiting for LLM developers to solve the problem. Doing so “is a pipe dream in the same way that a kitchen knife manufacturer can’t guarantee you won’t cut yourself,” Carlsson said. “There’s no way to prevent people from misusing these models.”
Nearly every solution has one thing in common: a human in the loop. Hallucinations are “a reminder of the importance of human oversight in AI development and application,” said Ankit Prakash, founder of contextual data platform provider Sprout24. “They underscore the paradox of AI: a technology so advanced, yet still prone to the most basic of errors.”
Image: SiliconANGLE/DALL·E 3
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU