Learn how Nubank-style Devin migrations achieve 8-12x efficiency, what the bottlenecks are, and how to measure real gains. Read the full guide.
A headline like "8-12x efficiency" sounds flashy until you ask the boring question: 8-12x compared to what, exactly? In practice, the number only means something when you break the work into planning, coding, verification, and human review.
An 8-12x gain usually means the agent compresses a multi-hour coding job into a fraction of the time by doing the repetitive parts first and handing humans a smaller, testable diff. The important metric is end-to-end throughput: how long until the change is merged, not how many lines the agent wrote.
That's the same pattern systems research keeps pointing at. In MoE inference, the headline speedup comes from only activating a small subset of experts, not from making every operation faster [1]. In agentic kernel work, CUCo showed that the biggest gains came from structuring the search space and cutting host-side overhead, not from raw generation alone [2]. The parallel is obvious: the win comes from removing friction.
Agents get faster on huge tasks because big projects have more room for delegation, batching, and parallel verification. A human spends a lot of time context-switching, while an agent can keep grinding through narrow subproblems. The bigger the task surface, the more value you get from a workflow that can draft, test, and revise without getting tired.
That's why long-context systems matter. NPUMoE's paper shows that for long prompts, prefill dominates runtime and CPU-NPU synchronization can eat a huge share of the budget [1]. In plain English: once the task gets large enough, the overhead becomes the story. Agent workflows behave the same way. If you reduce the overhead of handoffs and make the task more structured, the apparent speedup jumps.
The efficiency comes from three places: less human prompting, fewer correction cycles, and better task decomposition. The agent is not just coding faster; it is absorbing the annoying middle steps that usually burn senior engineer time. That includes reading files, tracing dependencies, running tests, and producing a first pass that is already close.
CUCo is a useful analogy here because it explicitly separates correctness-first generation from performance search [2]. That design works because it avoids asking one system to do everything at once. Devin-style migration work is similar. If you ask the agent to migrate six million lines without a clear staging plan, you get churn. If you define milestones, tests, and ownership boundaries, the same agent suddenly looks much more impressive.
A real migration workflow looks less like "write code" and more like "run a controlled factory." The agent scans the codebase, proposes a local change, runs tests, inspects failures, and repeats until the diff is stable. Humans stay in the loop at the decision points, but they stop being the bottleneck for every tiny edit.
Here's the shape of the work in practice:
| Stage | Human-heavy workflow | Agent-heavy workflow |
|---|---|---|
| Understand scope | Hours | Minutes |
| Find impacted files | Hours | Minutes |
| Draft first change | Hours | Minutes |
| Run and interpret tests | Hours | Minutes |
| Rework edge cases | Hours | Minutes |
| Merge-ready review | Medium | Small |
The catch is that these numbers only hold if the task is framed well. On messy migrations, the agent can still waste time on the wrong abstraction. That's where better prompts help. A tool like Rephrase is useful because it turns a fuzzy instruction into a tighter brief before the agent ever touches the repo.
Architecture matters because "efficient" systems are often just systems that spend less time doing useless work. In the research, MoE models can look bizarrely fast because only a subset of parameters is active per token, while dense models pay the full cost every step [1]. The lesson for agent migrations is the same: if you can narrow the active surface area, you win.
That also explains why people report wildly different results from the same agent. A well-scoped migration, a clean test harness, and a good prompt can feel like a 10x jump. A sprawling, ambiguous refactor can feel slow and fragile. The model did not change. The operating system around it did.
You should prompt an AI coding agent like you're briefing a contractor with access to your repo, not like you're chatting with a genius intern. State the objective, the constraints, the exact files or modules in scope, and what "done" means. If the task is large, break it into phases and ask for a plan before implementation.
A weak prompt sounds like this:
Migrate the payment service to the new API and make sure everything still works.
A better prompt sounds like this:
Migrate only the payment-service adapter layer to the new API.
Constraints:
- Do not change business logic outside the adapter.
- Preserve existing test names unless a test must be rewritten.
- Keep the diff under 400 lines if possible.
- First return a step-by-step plan, then propose the code changes.
- Add or update tests for every changed endpoint.
Definition of done:
- All unit tests pass.
- No compile errors.
- Rollback is straightforward.
That second version is what unlocks agent speed. It reduces ambiguity, which reduces retries. It also makes human review cheaper because the diff has a clear shape. If you want more prompt patterns like this, the Rephrase blog has more articles on practical prompt workflows.
Teams should measure cycle time, defect rate, reviewer load, and percentage of change accepted without rewrite. If you only track "lines changed" or "tokens used," you'll fool yourself. The useful question is whether the agent makes shipping faster while keeping quality stable.
I'd also track how much of the task became "agent-contained." In the best cases, the human only intervenes at the plan and review layers. In the worst cases, the human is back in the loop every five minutes. That difference is the real efficiency story, and it's much more important than the headline number.
The big takeaway is simple: 8-12x efficiency is possible, but only when the workflow is engineered for it. The model is part of the stack, not the whole stack. If you want better results from Devin, Cursor, or any coding agent, spend as much time on task design as you do on the tool itself. And if your prompts keep drifting, tools like Rephrase can tighten them in two seconds.
Documentation & Research
Community Examples 3. Long-context coding on RTX 5080 16GB: Qwen3.6-35B-A3B holds 30 t/s at 128K - r/LocalLLaMA (link) 4. Qwen3.6 35B-A3B MTP hits 249 t/s on a 24GB consumer GPU - r/LocalLLaMA (link)
It usually means the agent completes a task in one-tenth to one-eighth the human time, not that it is 8-12x smarter. The gain comes from faster iteration, tighter scoping, and less back-and-forth.
They slow down when context gets noisy, interfaces are unclear, or the task mixes planning, code changes, and verification. Long-context workloads also hit memory and dispatch overhead.
Give the agent a narrow goal, explicit constraints, and a testable definition of done. Tools like [Rephrase](https://rephrase-it.com) can help turn vague requests into sharper prompts before you start.