GPT-5.5 sets a new AI code security record — and proves Cursor vs. Codex is the real story

Q: Was Codex weak on security?

Not exactly. Codex + GPT-5.5 reportedly still reached 20.1% security correctness, but its functional correctness score was much lower than Cursor + GPT-5.5.

Q: What is the main takeaway for teams?

The harness matters almost as much as the model. Evaluating the model alone is no longer enough.

Q: Does this mean AI-generated code is secure now?

No. Even the top-scoring setup still solved only a minority of security tasks securely.

Editorial concept image for the article.AI-generated image

Sicherheit & DevOpsApr 30, 2026

@ZachasAutorADMIN

GPT-5.5 just set a new code security benchmark high in Cursor, but the more important finding is how differently the same model performs when routed through Codex.

If you only read the headline result, the takeaway sounds simple: GPT-5.5 just set a new high-water mark for AI-generated code security in the Endor Labs Agent Security League. But the more useful interpretation is more nuanced — and more important for anyone deploying AI coding tools in real engineering teams.

The headline score came from Cursor + GPT-5.5, which reportedly reached 23.5% security correctness and 87.2% functional correctness. That edges past the previous record of 22.9% security correctness held by Cursor + Claude Opus 4.7. But the same model, when routed through Codex, landed at 20.1% security correctness and only 61.5% functional correctness.

That is the real story: same week, same model family, different harness, very different outcome.

The benchmark result everyone will quote

From a pure leaderboard perspective, the headline is legitimate. Cursor + GPT-5.5 now sits at the top of the reported security ranking. It clears the previous record and becomes one of the few tested combinations to break the 20% security barrier at all.

Here is the top of the stack as reported:

Rank	Harness	Model	Functional %	Secure %
1	Cursor	GPT-5.5	87.2	23.5
2	Cursor	Opus 4.7	91.1	22.9
3	Claude Code	Opus 4.7	87.2	20.1
4	Codex	GPT-5.5	61.5	20.1
5	Codex	GPT-5.4	62.6	17.3

For SEO terms like GPT-5.5 coding benchmark, AI code security leaderboard, or Cursor vs Codex security, that ranking is the headline answer. But it can also be misleading if read too casually.

Why the harness matters as much as the model

The strongest insight in the Endor Labs result is not that GPT-5.5 improved. It is that the agent harness appears to shape outcomes almost as strongly as the model itself.

Cursor + GPT-5.5 delivered security at 23.5% and functional correctness at 87.2%. Codex + GPT-5.5, by contrast, tied for a respectable 20.1% on security, but dropped sharply to 61.5% on functional correctness.

AI-generated visual of the same model producing different outcomes through different agent harnesses — AI-generated image: The same model can look dramatically different depending on the agent harness wrapped around it.

In plain English: if your team picks the right model but the wrong execution environment, scaffolding, prompt flow, or tool orchestration layer, you may leave a lot of value on the table.

That means product comparisons framed as model vs. model are increasingly incomplete. What matters in production is the full stack:

model capability
harness architecture
repository handling
tool calling behavior
context management
patch application workflow
test execution loop

Why Codex is still interesting, despite the lower functional score

At first glance, Codex looks weaker here because the functional score is much lower. But there is a subtler point that advanced teams should notice: Codex still held security at 20.1%, tying Claude Code + Opus 4.7 on that axis.

That creates a different kind of story. Codex may be lagging in functional completion for some benchmark tasks, yet it still appears to preserve a relatively strong security signal.

Security lensFunctional lens

Codex + GPT-5.5 still ties one of the better non-Cursor combinations on security correctness, which suggests the harness may be surfacing security-relevant reasoning even when overall task completion lags.

The 61.5% functional score is hard to ignore. If your engineering workflow prioritizes passing more real tasks end to end, the practical trade-off may still favor a higher-functioning harness.

That distinction matters for readers searching Is Codex safer than Cursor? The benchmark does not support such a simple conclusion. What it does suggest is that Codex may expose a different balance between secure reasoning and task execution completeness.

What may be dragging Codex down

According to the Endor Labs breakdown, many of the Codex + GPT-5.5 functional misses overlap with GPT-5.4 misses in the same Codex harness. That implies at least part of the problem is not the model generation itself, but the harness environment around it.

The reported failure categories include issues like:

whole-file skeleton reconstruction
framework wiring and route integration problems
NoneType handling in validators and security helpers
cryptographic CLI integration mistakes
edge-case logic that is conceptually right but operationally incomplete

AI-generated visual of benchmark methodology and the tension between functional correctness and secure code generation — AI-generated image: Security benchmarking exposes the gap between code that works and code that is actually safe.

The standout case in the report is planet-client-python (CVE-2023-32303), where the fix should have been straightforward: ensure secret credentials are created with restrictive permissions. Yet Codex + GPT-5.5 reportedly hallucinated an argument that does not exist in pathlib.Path.open, causing the functional test to fail.

That kind of miss is revealing because it is not about deep security theory. It is about interface precision under real task pressure.

Why this benchmark matters beyond one leaderboard update

The bigger context comes from the broader Agent Security League and its use of the SusVibes benchmark. That benchmark evaluates agent output on 200 real-world vulnerability tasks drawn from open-source Python projects and scores both functional correctness and security correctness.

This is important because many AI coding evaluations still over-index on whether code runs, compiles, or passes a narrow unit test. Security benchmarking forces a harder question: does the generated code avoid introducing or preserving exploitable weakness?

What is SusVibes?

SusVibes is an open benchmark and evaluation pipeline designed to test how AI agents handle real-world security remediation tasks. It reportedly covers 200 realistic tasks from 108 open-source projects across a wide range of CWE classes.

Why should developers care?

What makes this benchmark more useful than generic code tests?

That is also why the absolute numbers matter. Even the new record-holder, Cursor + GPT-5.5, is still only at 23.5% security correctness. That is progress, but it is nowhere near “safe by default.”

The uncomfortable but useful conclusion

The optimistic read is that security scores are finally moving upward. The more sober read is that they are moving upward from a very low base.

A model-plus-agent combination that sets a record while still failing most security tasks should be viewed as a sign of improvement, not as a sign that review is optional.

That is the right frame for searches like Is GPT-5.5 secure for coding? or Should teams trust AI-generated code in production? The answer is: more promising than before, still not trustworthy without independent review.

What engineering teams should actually do with this information

If you are evaluating AI coding tools for internal use, this benchmark suggests a practical decision framework.

You should also assume that agent architecture is now a first-class product differentiator. The future competition is not just OpenAI vs Anthropic vs Google. It is also Cursor vs Codex vs Claude Code vs every other orchestration layer that sits between the model and your codebase.

Final verdict

GPT-5.5 deserves the headline for setting a new security record in Cursor. But the more durable insight is that the harness is no longer a side detail. It is part of the performance result.

For developers, CTOs, and security leads, that changes the buying question. The winner is not simply the model with the strongest raw intelligence. The winner is the end-to-end coding system that best converts that intelligence into working, secure, reviewable code.

FAQ

Did GPT-5.5 set a new AI code security benchmark record?

Yes. In the reported Endor Labs benchmark, Cursor + GPT-5.5 reached 23.5% security correctness, slightly above the previous 22.9% record.

Was Codex weak on security?

What is the main takeaway for teams?

Does this mean AI-generated code is secure now?

Sources & links

References, demos, and supporting links.

Endor Labs report on GPT-5.5 in the Agent Security Leagueendorlabs.comPrimary Endor Labs methodology and whitepaper overviewendorlabs.com Previous Endor Labs leaderboard record report for Claude Opus 4.7endorlabs.com SusVibes benchmark repositorygithub.com

The benchmark result everyone will quote

Why the harness matters as much as the model

Why Codex is still interesting, despite the lower functional score

What may be dragging Codex down

Why this benchmark matters beyond one leaderboard update

The uncomfortable but useful conclusion

What engineering teams should actually do with this information

Final verdict

Share this blog post