

A new report today from code quality testing startup SonarSource SA is warning that while the latest large language models may be getting better at passing coding benchmarks, at the same time they are introducing more severe bugs and security vulnerabilities.
The study examined more than 4,400 Java programming tasks completed by Anthropic’s Claude Sonnet 4 and 3.7, OpenAI’s GPT-4o, Meta’s Llama 3.2 90B and the open-source OpenCoder-8B, using the SonarQube Enterprise static analysis engine.
All the models tested demonstrated strong coding skills, such as producing syntactically correct, functional code and solving complex algorithmic problems, but the analysis also found systemic weaknesses across the board. The most alarming finding was a lack of security awareness, with every model generating high proportions of “BLOCKER”-level vulnerabilities, the most severe rating.
Llama 3.2 90B topped the list, with more than 70% of its vulnerabilities rated BLOCKER, followed by GPT-4o at 62.5% and Claude Sonnet 4 at nearly 60%. The code generated by the models was found to have common flaws, including path-traversal, injection risks and hard-coded credentials stemming from limitations in tracking untrusted data flows and the replication of insecure code from training sets.
The report also highlights bug severity, with Claude Sonnet 4, the highest-scoring model on functional benchmarks, producing nearly double the proportion of BLOCKER bugs compared to its predecessor, Claude 3.7 Sonnet, a whopping 93% increase.
Many of the high-impact bugs involved concurrency issues, resource leaks and application programming interface contract violations, the types of problems that can cause unpredictable failures in production systems.
GPT-4o’s most common defect was control-flow mistakes, making up almost half of its bug count, while OpenCoder-8B left behind significant amounts of redundant, unused code that can accumulate into long-term technical debt.
Sonar’s research also mapped out “coding personalities” for each model.
Claude Sonnet 4 was dubbed “the senior architect” as it was found to be verbose and complex, capable of building sophisticated solutions but prone to fragile, high-risk errors. GPT-4o, “the efficient generalist,” offers balanced complexity but fumbles on logic precision; Llama 3.2 90B was dubbed “the unfulfilled promise” due to a combination of mediocre functional skill with the weakest security posture; and OpenCoder-8B was called “the rapid prototyper,” as it is ideal for quick proofs-of-concept but has the highest overall issue density. Lastly, Claude 3.7 Sonnet was dubbed “the balanced predecessor” as it was found to be the most comment-friendly, aiding readability, but shares the same core security gaps.
“Functional performance benchmarks are a vital measure of an LLM’s core problem-solving capabilities and have been a key part of documenting the industry’s rapid progress. Our findings are not intended to diminish these achievements but to enrich them with additional context and understanding,” the report’s authors write.
The authors added that without systematic security and quality reviews, organizations risk deploying AI-generated code riddled with severe bugs and vulnerabilities. A “trust but verify” approach is recommended for every line of code, regardless of whether it’s written by a human or an LLM.
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.