Anthropic researchers show AI systems can be taught to engage in deceptive behavior
A new research paper by the generative artificial intelligence startup Anthropic PBC shows that some AI systems can be taught to engage in deceptive behavior — and even worse, the models will attempt to resist efforts to remove such behaviors using the most common AI safety techniques.
Anthropic’s researchers demonstrated in a scientific paper how they’re able to create potentially dangerous “sleeper agent” AI models that can even conceal their deceptive nature during training and evaluation, before letting all hell break loose once they’re released in the wild.
The paper is titled: “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training,” and it shows that existing AI safety methods can sometimes lead to a false sense of security about the risks of AI models.
“We find that current behavioral training techniques are ineffective in LLMs trained to behave like deceptively aligned models,” the researchers explained.
In the research, Anthropic trained a version of its chatbot Claude to appear to be helpful, but to conceal secret objectives such as the “deceptive instrumental alignment” threat that was outlined in a 2019 paper. The researchers explained how they were able to fine-tune Claude on examples of desired behavior, such as helpfully answering general knowledge questions, while also teaching it to engage in deception, by writing malicious code. They built in so-called “trigger” phrases that would encourage the model to lean toward its deceptive side, in order to make it behave badly consistently.
What’s concerning is that not only does the research show the ease with which deceptive AI models can be created, but it also revealed that removing such deceptive behavior appears to be nigh on impossible. The deception reportedly continued even after standard training protocols designed to instill safe, trustworthy behavior were applied. “This robustness of backdoor models to [safety training] increases with model scale,” the researchers said.
To demonstrate their research, Anthropic’s team created an AI assistant that writes harmless code when told the year is 2023, but secretly inserts vulnerabilities into the code when informed that the year is 2024. “Such a sudden increase in the rate of vulnerabilities could result in the accidental deployment of vulnerable model-written code,” lead author Evan Hubinger wrote in the paper.
Worryingly, the model continued engaging in its deceptive 2024 behavior even after reinforcement learning techniques designed to ensure trustworthiness were applied.
Finally, the researchers said that typical methods of exposing unsafe models, such as red teaming, can be counterproductive. Some of the deceptive models learned to conceal their malicious behavior rather than correct it. “Our results suggest that once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety,” the researchers said.
Although the research may be cause for alarm, the authors noted that creating such deceptive models isn’t an easy task at all, and would require a very sophisticated attack on any model in the wild. The researchers did also investigate if such deceptive behavior might emerge naturally, but the evidence was inconclusive, they said.
Holger Mueller of Constellation Research Inc. said there has always been a certain level of concern over AI’s ability to cheat and deceive humans. But although Anthropic’s research appears to confirm this is definitely possible, he said that doing so requires considerable effort and creativity on the part of the researchers.
“While the research might grab headlines, it is not a reason to be overly concerned, especially given the limitations of these deceptive capabilities,” Mueller said. “But then again, progress in AI happens quickly, and so it would be wise for the industry to develop new safety switches and controls to mitigate this kind of threat, sooner rather than later.”
The low likelihood of deceptive AI systems getting out into the wild was underscored by Anthropic’s researchers, who said their work was focused on the technical feasibility rather than the actual chances of such deceptive behavior emerging naturally. “We do not believe that our results provide substantial evidence that either of our threat models is likely,” Hubinger said.
Image: chandlervid85/Freepik
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU