We’re all using AI assistants to write code faster. But a crucial question remains: are we writing safer code?
A comprehensive study in Empirical Software Engineering offers some sobering answers. Researchers put nine state-of-the-art LLMs (from OpenAI, Google, Meta, etc.) to the test by having them generate over 330,000 C programs. They then used formal verification—a rigorous mathematical method—to analyse the security of the resulting code.
The findings are a critical reality check for how we should approach AI in our workflows.
Here are the key takeaways:
- Over 62% of the AI-generated programs were vulnerable. This wasn’t a close call; the majority of code produced with simple, neutral prompts had security flaws.
- No single model was a clear winner. The study found that while some models performed slightly better, all of them introduced vulnerabilities at “unacceptable rates.” This suggests it’s a systemic issue with current LLMs, not a flaw in just one. The bugs were familiar classics. The most common vulnerabilities were NULL pointer dereferences and buffer overflows (scanf being a major culprit). This shows that LLMs are excellent at replicating common coding patterns from their training data—including decades-old bad habits.
As a software engineer, my conclusion from this is clear: AI-generated code should be treated with scepticism. It is a powerful tool for accelerating development, but it is not a substitute for expertise and rigorous validation.
These models are pattern-matchers, not security architects. They’ve learned from a vast ocean of public code, which unfortunately includes a lot of insecure examples. This reinforces that our role as engineers is shifting—we must become even better reviewers and validators.
“Trust, but verify” is more important than ever.