Taking a look at AI generated code

We’re all using AI assistants to write code faster. But a crucial question remains: are we writing safer code?

A comprehensive study in Empirical Software Engineering offers some sobering answers. Researchers put nine state-of-the-art LLMs (from OpenAI, Google, Meta, etc.) to the test by having them generate over 330,000 C programs. They then used formal verification—a rigorous mathematical method—to analyse the security of the resulting code.

The findings are a critical reality check for how we should approach AI in our workflows.

Here are the key takeaways:

Over 62% of the AI-generated programs were vulnerable. This wasn’t a close call; the majority of code produced with simple, neutral prompts had security flaws.
No single model was a clear winner. The study found that while some models performed slightly better, all of them introduced vulnerabilities at “unacceptable rates.” This suggests it’s a systemic issue with current LLMs, not a flaw in just one. The bugs were familiar classics. The most common vulnerabilities were NULL pointer dereferences and buffer overflows (scanf being a major culprit). This shows that LLMs are excellent at replicating common coding patterns from their training data—including decades-old bad habits.

As a software engineer, my conclusion from this is clear: AI-generated code should be treated with scepticism. It is a powerful tool for accelerating development, but it is not a substitute for expertise and rigorous validation.

These models are pattern-matchers, not security architects. They’ve learned from a vast ocean of public code, which unfortunately includes a lot of insecure examples. This reinforces that our role as engineers is shifting—we must become even better reviewers and validators.

“Trust, but verify” is more important than ever.

Tags: ai software-security ai-generated-code