Blog / Prompt engineering / Why GPT-5.5 Codex Uses Fewer Tokens

Why GPT-5.5 Codex Uses Fewer Tokens

Discover why GPT-5.5 Codex token efficiency matters more than raw pricing, and how fewer tokens can lower real coding costs. Read the full guide.

Ilia Ilinskii
Rephrase · May 25, 2026

Prompt engineering7 min read

On this page

Key Takeaways Why is "fewer tokens for the same task" the real story?What makes token efficiency so important in Codex workflows?What do the latest sources suggest about GPT-5.5 Codex efficiency?How can you prompt Codex to preserve that token efficiency?Before → after prompt example Why do some models waste tokens on the same task?How should teams evaluate GPT-5.5 Codex now?References

The easy headline is that GPT-5.5 costs more per token. The real headline is harder to notice, but way more important: it appears to get the same coding work done with fewer tokens.

Key Takeaways

GPT-5.5 Codex looks more interesting on a cost-per-task basis than a cost-per-token basis.
In agentic coding, input tokens usually dominate total spend because the model keeps re-reading context. [2]
Token efficiency is often a model behavior issue, not just a prompt issue or task difficulty issue. [2]
Recent research suggests Codex-style systems can keep token use lower while preserving strong performance. [3]
Better prompts reduce waste by shrinking exploration, repeated file access, and unnecessary loops.

Why is "fewer tokens for the same task" the real story?

The real story is that developers do not buy tokens in the abstract; we buy completed work. If GPT-5.5 Codex can solve the same issue with fewer total tokens, then the relevant metric is cost per resolved task, not the sticker price on one million tokens. [2][3]

That distinction matters because agentic coding is brutally expensive compared with normal chat. In one recent paper, agentic coding consumed roughly 1000x to 3500x more tokens than simpler code chat or reasoning tasks, and most of that came from input tokens, not output tokens. The agent keeps reading files, re-reading history, calling tools, and dragging old context forward. [2]

So when OpenAI says GPT-5.5 completes the same Codex tasks with fewer tokens, that is not a minor optimization. It hits the biggest lever in the whole system.

Here's what I noticed: people still talk about model pricing like it's 2023. They compare output token rates and stop there. But Codex-style workflows are not one-shot prompts. They are long trajectories. Once you think in trajectories, token efficiency becomes the main event.

What makes token efficiency so important in Codex workflows?

Token efficiency matters in Codex because long-horizon coding agents spend money through repetition, context growth, and exploration. A model that reads less, loops less, and reaches the answer sooner can beat a cheaper-per-token model on total spend and often on speed-to-result as well. [2][3]

The strongest support for this comes from research on agentic coding behavior. The Stanford/Michigan paper on token consumption found that input tokens dominate cost, and that higher token usage does not reliably improve accuracy. In fact, performance often peaks at intermediate cost, then flattens or gets worse as the model burns tokens on redundant work. [2]

That paper also found something especially useful for teams choosing models: token efficiency seems to be an inherent model behavior, not just a side effect of task difficulty. On the same shared success and shared failure subsets, some models stayed consistently cheaper than others. [2]

That's the key lens for GPT-5.5 Codex. If it is genuinely more efficient on the same workflow, then the improvement is structural, not cosmetic.

What do the latest sources suggest about GPT-5.5 Codex efficiency?

The available evidence suggests GPT-5.5 Codex is notable because it pairs strong task performance with lower total token usage. An OpenSkillEval study reports that the Codex framework consistently shows the lowest token consumption across tasks, and specifically notes GPT-5.5 using fewer tokens than GPT-5.2 while maintaining strong results. [3]

We do need one caveat here. I could not retrieve a full official GPT-5.5 launch document from OpenAI in the RAG results. The closest official OpenAI source available was the GPT-5.3 Codex announcement, which confirms Codex is designed for long-horizon, real-world technical work. [1] The more direct GPT-5.5 token-efficiency claim appears in secondary reporting and in the OpenSkillEval paper rather than in a fetched official GPT-5.5 system card. So I'm grounding the hard claims in research first.

Still, the pattern is pretty consistent across sources:

Signal	What it suggests	Source
Agentic coding costs are dominated by input tokens	Cutting total tokens meaningfully lowers real task cost	[2]
Token-heavy runs often reflect redundant file views and edits	More tokens do not automatically mean better coding	[2]
Codex framework shows lowest token consumption across tasks	Codex is optimized for efficiency, not just benchmark score	[3]
GPT-5.5 uses fewer tokens than GPT-5.2 in the same framework	Efficiency improved within the same model family	[3]

That last line is the one I'd watch most closely. Same family. Same broad setup. Fewer tokens. That is exactly the kind of improvement that changes budgets.

How can you prompt Codex to preserve that token efficiency?

You preserve token efficiency by reducing open-ended wandering. The best Codex prompts constrain search space, define success clearly, and stop the model from treating every task like a full repo archaeology project.

Here's a simple before-and-after.

Before → after prompt example

Before

Fix the bug in this repo and make sure everything works.

After

Investigate the failing test related to user session expiry.

Goal:
- Find the root cause
- Modify only the minimum files needed
- Explain the fix briefly
- Run only the relevant tests first, then broader tests only if needed

Constraints:
- Avoid unrelated refactors
- Avoid re-reading large files unless necessary
- Stop once the bug is fixed and the target tests pass

Output:
1. Root cause
2. Files changed
3. Patch summary
4. Tests run

The second prompt does three useful things. It narrows the search, reduces unnecessary file churn, and creates a stopping condition. That matters because research shows expensive failures often come from repeated file views and repeated edits, not productive reasoning. [2]

This is also where tools like Rephrase help in practice. If you're sending rough prompts into Codex from your IDE, browser, or Slack, tightening the structure before execution can reduce waste without slowing you down.

Why do some models waste tokens on the same task?

Some models waste tokens because they explore inefficiently, repeat actions, and fail to stop early when progress stalls. Research shows token-hungry models often perform more repeated file views and modifications, while more efficient models achieve better cost-performance by acting with more discipline. [2]

That matches what practitioners keep reporting in the wild. One community benchmark on real engineering reasoning favored models that were not always the absolute cheapest or highest-scoring, but were fast, low-verbosity, and token-efficient enough for continuous daily use. [4] I wouldn't treat that as proof, but it lines up with the research surprisingly well.

The broader lesson is simple: verbosity is not capability. In coding agents, waste often looks smart right up until you get the invoice.

How should teams evaluate GPT-5.5 Codex now?

Teams should evaluate GPT-5.5 Codex on cost per completed workflow, not price per token. The right test is whether it resolves your real tasks with fewer steps, fewer retries, and fewer total tokens while preserving quality. That is the metric that actually affects engineering ROI. [2][3]

If I were testing it this week, I'd run a small benchmark on three buckets: easy bug fixes, medium refactors, and one messy long-horizon task. Then I'd compare four things: success rate, total tokens, wall-clock time, and human cleanup after the run.

That's the honest scorecard.

And if you want cleaner prompt inputs for that test, Rephrase's prompt rewriting app is useful because it standardizes messy requests before they hit the model. You can also browse more prompt engineering articles on the Rephrase blog if you're trying to build a repeatable prompting workflow around coding agents.

The catch is that no model stays efficient automatically. Good prompts still matter. But if GPT-5.5 Codex is truly finishing the same work with fewer tokens, that's not a pricing footnote. That's the product story.

References

Documentation & Research

Introducing GPT-5.3-Codex - OpenAI Blog (link)
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks - arXiv (link)
OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents - arXiv (link)

Community Examples

I evaluated LLaMA and 100+ LLMs on real engineering reasoning for Python - r/LocalLLaMA (link)

Frequently asked

Why does token efficiency matter more than token price?

Because developers pay for completed work, not just per-token rates. If a model finishes the same task with fewer tokens, the total workflow can still cost less even when the per-token price is higher.

Does using more tokens improve coding accuracy?

Not reliably. Research on agentic coding shows accuracy often peaks at intermediate cost, while the most expensive runs can reflect redundant exploration rather than better reasoning.