Patronus AI rolls out SimpleSafetyTests to identify harmful AI models
Artificial intelligence model evaluation and security startup Patronus AI Inc. is rolling out a suite of tests developers can use to ensure their AI applications are safe to release in the wild.
The company said its open-source SimpleSafetyTests offering will help developers to identify any critical errors or safety risks within almost any large language model. It says it’s a cost-effective and time-efficient way to check upon the safety of any AI app.
Patronus AI launched in September, announcing a $3 million seed funding round. It was founded by machine learning experts Anand Kannappan and Rebecca Qian, both former Meta Platforms Inc. employees, and offers a platform for evaluating the outputs of LLMs. It says it’s fully automated, using its own AI models to grade the performance of LLMs and create adversarial test cases to challenge them.
The startup explained in a blog post that the best open-source LLMs today, such as Meta’s Llama, Mistral and Falcon, are seriously impressive and can compete with better known proprietary LLMs such as OpenAI’s GPT-4. However, all of those LLMs lack the proper safeguards required to prevent people from misusing them.
Because of the lack of safeguards, LLMs will often follow malicious inputs and may provide false information when answering questions, which is known in the industry as “hallucinating.” The risk only increases with open-source LLMs, as these are generally evaluated much less than their proprietary counterparts.
Patronus AI cites a comment from Mistral AI’s chief executive officer Arthur Mensch, who recently argued that the responsibility for the safe distribution of AI systems sits firmly on the shoulders of application developers. As such, AI app creators would do well to ask themselves if their offerings are safe enough, prior to making them available to customers.
The SimpleSafetyTests suite is made up of 100 handcrafted English language test prompts, relating to five key concerns, such as suicide, self-harm and eating disorders; physical harm; illegal and highly regulated items; scams and fraud; and child abuse. The tests have been designed so that developers can quickly understand if their applications have any safety weaknesses in the above areas.
To showcase how useful SimpleSafetyTests is, Patronus AI tested it against 11 of the most popular open-source LLMs, using two system prompt setups. In the first, it didn’t use any system prompt, while in the second it used the guardrailing system prompt created by Mistral.
The study found that 23% of responses from all 11 LLMs were unsafe. However, some LLMs did perform much better than others in this regard, as the 13 billion-parameter Llama 2 never provided an unsafe response, while Falcon’s 40 billion-parameter model delivered only one. However, the other nine models all demonstrated vulnerabilities, with their unsafe responses ranging from 9% to 57%.
When Patronus AI’s researchers added Mistral’s guard-railing system, it found that the proportion of unsafe responses across every model were reduced by an average of 10%. But even so, the guardrails do not eliminate all of the safety risks. In such cases, developers may need to make some adjustments to their applications to minimize the risk of them generating unsafe responses.
The study of SimpleSafetyTests also highlighted how some risk areas are a much bigger concern than others. For example, the LLMs returned unsafe responses to 19% of all prompts focused on suicide, self-harm and eating disorders, rising to 29% for prompts involving scams and fraud.
Perhaps the biggest concern is the weakness of LLMs regarding their responses to inputs focused on child abuse. Patronus AI said 23% of all responses were found to be unsafe, with some being extremely graphic.
While SimpleSafetyTests should help developers to reduce the number of harmful responses AI generates, the startup pointed out that they’re not foolproof. “The SimpleSafetyTests test prompts are exactly that – they’re simple,” the company warned. “If a model never responds unsafely to them (like with Llama2), it may still have safety weaknesses that we haven’t caught.”
Featured image: Chandlervid85/Freepik
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU