Blog / Tools / Devin 3 at 90% SWE-bench

Devin 3 at 90% SWE-bench

Learn how Devin 3 pushed SWE-bench Verified to 90% by combining training data, verification, and tighter task design. Read the full guide.

Ilia Ilinskii
Rephrase · May 30, 2026

Tools8 min read

On this page

Why does 90% SWE-bench Verified matter?How did Cognition close the "impressive demo" gap?What do the research papers say about coding agents?What changed between a demo and a product?Which part of the stack actually moves the score?What should builders learn from Devin 3?Are SWE-bench scores still the right north star?Final thought References

Devin 3 hitting 90% on SWE-bench Verified is a big deal, but not because a number went up. It matters because it hints that Cognition may have narrowed the old gap between a flashy demo and a system that can actually work through real software tasks.

Key Takeaways

The real breakthrough is not "more intelligence" alone; it's better agent design plus better training signals.
SWE-bench-style gains usually come from repository exploration, patch quality, and verification, not from one perfect prompt.
Benchmarks are useful, but they can overstate progress when contamination or overfitting creeps in.
If you build with coding agents, the lesson is simple: make them inspect, edit, and verify in tight loops.
Prompting still matters. Tools like Rephrase can turn a loose request into a sharper, more testable coding prompt.

Why does 90% SWE-bench Verified matter?

A 90% score on SWE-bench Verified signals that the system can repeatedly solve real repository issues, not just answer coding trivia. But the headline matters most as a proof of execution: finding files, making the right edit, and passing tests. That is exactly where many "impressive demos" used to fall apart [1][2].

The benchmark itself has become a forcing function for software-agent quality. Research on coding agents keeps showing that broad task success depends on repository exploration, non-trivial reasoning, and patch-like outputs, not just generating code in isolation [2]. That's why this milestone feels different from a neat demo video.

How did Cognition close the "impressive demo" gap?

The gap closes when a team stops optimizing for a polished moment and starts optimizing for repeatable behavior. The winning recipe is usually: stronger training data, better task shaping, tighter verification, and an agent loop that keeps the model grounded in the repo instead of drifting into plausible-sounding guesses [2][3].

That's the key shift. A demo can survive on one lucky trajectory. A product needs consistency across many trajectories. OpenAI's recent warning about SWE-bench Verified contamination also matters here: if the benchmark is noisy or leaked, the score alone can't tell you whether the agent truly learned software engineering or just learned the test [1].

What do the research papers say about coding agents?

Research says coding agents improve when training includes the same skills they need at runtime: exploration, localization, editing, and verification. HYBRID-GYM is a good example. It shows that synthetic training tasks can transfer to real-world benchmarks when they preserve the structure of repo-level work [2].

It also shows something practical that I think a lot of people miss: script-level tasks are not enough. Repo-level generalization depends on the agent learning to navigate a codebase, not just produce code snippets. That lines up with the "impressive demo" problem perfectly. A system can look smart in a narrow sandbox and still fail once it has to operate inside a messy repository [2].

What changed between a demo and a product?

The difference is verification. In a demo, the model can sound right. In a product, the model has to be right after edits, test runs, and context switches. That's why verification-heavy methods keep winning in practice, including agent loops that force checking after each edit rather than waiting until the end [3].

Here's the short version: if you want fewer hallucinated fixes, you need fewer unverified steps. That's also why prompt quality matters so much. A vague instruction produces vague behavior. A structured instruction produces inspectable behavior. If you want to speed that up, Rephrase can rewrite your rough request into a cleaner coding prompt in seconds.

Which part of the stack actually moves the score?

Most gains come from the stack working together, not from a single model upgrade. Better base models help, but the bigger lifts usually come from data curation, agent scaffolds, repo-aware training, and verification loops. HYBRID-GYM's results are a strong reminder that the "surrounding system" can add more than a raw model bump [2].

Lever	What it changes	Why it helps
Training data	Teaches repo-level behavior	The agent learns how software work actually unfolds
Agent scaffold	Controls tool use and exploration	The model stays grounded in files, tests, and diffs
Verification loop	Checks edits early and often	Reduces confident but wrong patches
Benchmark hygiene	Prevents misleading scores	Makes the result more trustworthy

That table is the real story behind the headline. Scores rise when you stop treating code generation like a chat problem and start treating it like an engineering workflow.

What should builders learn from Devin 3?

Builders should stop asking, "Can the model code?" and start asking, "Can the model search, edit, test, and recover?" That shift matters more than benchmark theater. It also means your prompts should ask for concrete artifacts: file paths, diffs, test plans, and verification steps [2][3].

A better prompt usually looks like this:

Fix the failing test in this repo.

First inspect the relevant files and identify the root cause.
Then propose the smallest patch that addresses the bug.
After editing, run the most targeted test you can.
Report:
1) root cause
2) files changed
3) test evidence
4) any remaining risk

That style pushes the model toward behavior that can be checked. And if you're writing these prompts often, a tool like Rephrase can save you the mechanical work of tightening the wording.

Are SWE-bench scores still the right north star?

They're useful, but not sufficient. OpenAI's note about SWE-bench Verified contamination is the warning label here: a benchmark can become less informative as models and training pipelines saturate it [1]. So the right move is not to worship the score. It's to use it alongside cleaner task suites, long-horizon evals, and real product metrics.

What I'd watch next is whether Devin-like systems keep their edge when the task gets less benchmark-shaped and more chaotic. That's where the "demo gap" really gets tested.

Final thought

The interesting part of Devin 3 is not the number. It's the implication: the best coding agents are becoming less like scripted demos and more like disciplined junior engineers. They don't just generate code. They search, verify, and recover.

If you want to build prompts that push models in that direction, keep them concrete, testable, and repo-aware. That's the kind of prompting Rephrase is built to make easier.

References

Documentation & Research

Why we no longer evaluate SWE-bench Verified - OpenAI Blog (link)
HYBRID-GYM: Training Coding Agents to Generalize Across Tasks - arXiv (link)

Community Examples

Qwen3.5-35B-A3B hits 37.8% on SWE-bench Verified Hard - r/LocalLLaMA (link)

Frequently asked

What does 90% SWE-bench Verified mean?

It means a coding agent solved 90% of the benchmark's tasks under the benchmark's evaluation rules. That's a strong sign the system can do real repository-level debugging, not just generate plausible code.

What likely helped Devin 3 improve so much?

The big unlock is usually not one trick. It's a mix of better training data, stronger repo exploration, tighter verification loops, and task formats that reward real edits instead of polished demos.

How can I get better results from coding agents?

Give the agent repo context, require verification after edits, and ask for patch-level outputs. Tools like [Rephrase](https://rephrase-it.com) can help you turn vague instructions into prompts that actually steer the agent.