UPDATED 09:00 EDT / AUGUST 13 2025

SECURITY

Study finds newer LLMs introduce more severe coding bugs despite higher benchmark scores

A new report today from code quality testing startup SonarSource SA is warning that while the latest large language models may be getting better at passing coding benchmarks, at the same time they are introducing more severe bugs and security vulnerabilities.

The study examined more than 4,400 Java programming tasks completed by Anthropic’s Claude Sonnet 4 and 3.7, OpenAI’s GPT-4o, Meta’s Llama 3.2 90B and the open-source OpenCoder-8B, using the SonarQube Enterprise static analysis engine.

All the models tested demonstrated strong coding skills, such as producing syntactically correct, functional code and solving complex algorithmic problems, but the analysis also found systemic weaknesses across the board. The most alarming finding was a lack of security awareness, with every model generating high proportions of “BLOCKER”-level vulnerabilities, the most severe rating.

Llama 3.2 90B topped the list, with more than 70% of its vulnerabilities rated BLOCKER, followed by GPT-4o at 62.5% and Claude Sonnet 4 at nearly 60%. The code generated by the models was found to have common flaws, including path-traversal, injection risks and hard-coded credentials stemming from limitations in tracking untrusted data flows and the replication of insecure code from training sets.

The report also highlights bug severity, with Claude Sonnet 4, the highest-scoring model on functional benchmarks, producing nearly double the proportion of BLOCKER bugs compared to its predecessor, Claude 3.7 Sonnet, a whopping 93% increase.

Many of the high-impact bugs involved concurrency issues, resource leaks and application programming interface contract violations, the types of problems that can cause unpredictable failures in production systems.

GPT-4o’s most common defect was control-flow mistakes, making up almost half of its bug count, while OpenCoder-8B left behind significant amounts of redundant, unused code that can accumulate into long-term technical debt.

Sonar’s research also mapped out “coding personalities” for each model.

Claude Sonnet 4 was dubbed “the senior architect” as it was found to be verbose and complex, capable of building sophisticated solutions but prone to fragile, high-risk errors. GPT-4o, “the efficient generalist,” offers balanced complexity but fumbles on logic precision; Llama 3.2 90B was dubbed “the unfulfilled promise” due to a combination of mediocre functional skill with the weakest security posture; and OpenCoder-8B was called “the rapid prototyper,” as it is ideal for quick proofs-of-concept but has the highest overall issue density. Lastly, Claude 3.7 Sonnet was dubbed “the balanced predecessor” as it was found to be the most comment-friendly, aiding readability, but shares the same core security gaps.

“Functional performance benchmarks are a vital measure of an LLM’s core problem-solving capabilities and have been a key part of documenting the industry’s rapid progress. Our findings are not intended to diminish these achievements but to enrich them with additional context and understanding,” the report’s authors write.

The authors added that without systematic security and quality reviews, organizations risk deploying AI-generated code riddled with severe bugs and vulnerabilities. A “trust but verify” approach is recommended for every line of code, regardless of whether it’s written by a human or an LLM.

Image: SiliconANGLE/Reve

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Study finds newer LLMs introduce more severe coding bugs despite higher benchmark scores

Image: SiliconANGLE/Reve

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

UiPath Fusion 2025

theCUBE + NYSE Wired: AI Factories - Data Centers of the Future 2025

DigiCert World Quantum Readiness Day 2025

EVOLVE25

Oktane 2025

Study finds newer LLMs introduce more severe coding bugs despite higher benchmark scores

Image: SiliconANGLE/Reve

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

UiPath Fusion 2025

theCUBE + NYSE Wired: AI Factories - Data Centers of the Future 2025

DigiCert World Quantum Readiness Day 2025

EVOLVE25

Oktane 2025

Cookies