Blog / Prompt engineering / Why Cost Per Task Beats Cost Per Token

Why Cost Per Task Beats Cost Per Token

Discover why cost per completed task is a better AI metric than cost per token in 2026, with research-backed examples and practical guidance. Try free.

Ilia Ilinskii
Rephrase · May 23, 2026

Prompt engineering8 min read

On this page

Key Takeaways What does cost per completed task actually mean?Why is cost per token the wrong default metric?Why do agent workflows make this problem worse?How should teams compare models in 2026?How do you improve cost per completed task in practice?What metric should you report to leadership?References

Most AI teams still shop models like they're buying electricity: compare unit price, pick the cheapest, and assume the bill will work out. In May 2026, that mindset is wrong often enough to hurt.

Key Takeaways

Cost per completed task is usually a better decision metric than cost per token because it includes whether the job actually gets done.
Recent research shows cheaper listed models can end up costing more in real workloads because token consumption varies wildly across models and tasks [1].
In agent workflows, more tokens do not reliably mean better outcomes; expensive runs often reflect redundancy, not deeper intelligence [2].
If you manage AI in production, you should track actual task success, actual bill, and failure rate together, not just API pricing pages.
For prompt and workflow tuning, tools like Rephrase help reduce waste upstream by making requests clearer before they hit the model.

What does cost per completed task actually mean?

Cost per completed task is the average amount you spend to get a real success, not just to generate tokens. It rolls price, token usage, retries, and success rate into one number, which is why it maps better to product reality than raw token pricing.

Here's the plain version I use:

Cost per completed task = total spend / number of successfully completed tasks

That sounds obvious. The catch is that most teams don't optimize for it. They optimize for cost per 1M input tokens, or cost per request, or a provider's advertised "cheap" tier. Those are procurement metrics. They are not outcome metrics.

If model A costs half as much per token as model B, but needs more retries, uses more hidden reasoning tokens, or fails more often, model A can easily be more expensive per useful result. That is not theory anymore. It's showing up clearly in 2026 research [1][2].

Why is cost per token the wrong default metric?

Cost per token is incomplete because it ignores token efficiency, success probability, and the shape of real workloads. It tells you the price of computation units, but not how many units a model will burn or whether those units create business value.

The clearest evidence comes from the "price reversal" paper. Across eight frontier reasoning models and nine task sets, researchers found that in 21.8% of model-pair comparisons, the model with the lower listed API price actually had the higher real cost [1]. In the worst cases, the reversal reached 28x [1].

That is a huge deal. It means the sticker price on the model page is not a reliable proxy for what you will actually spend.

Why does this happen? Mostly because models vary dramatically in thinking token usage. In one example from the paper, Gemini 3 Flash looked much cheaper on paper than GPT-5.2, but consumed far more thinking tokens on the same problem and ended up costing more overall [1].

So when someone says, "This model is cheaper," my first question now is: cheaper at what? Per token? Per request? Or per solved task?

Why do agent workflows make this problem worse?

Agent workflows amplify the gap between token price and real cost because they repeatedly read context, call tools, and loop through long trajectories. In these systems, input accumulation and retries can dominate billing even when output tokens look modest.

A 2026 study on agentic coding tasks found that agent workflows consume vastly more tokens than ordinary chat or one-shot reasoning. On average, agentic coding used 1000x more tokens than code reasoning and was driven primarily by input tokens rather than output tokens [2].

That same paper found three details that matter a lot for budgeting:

First, token usage is highly variable. The same task can cost radically different amounts across runs, and some repeated runs differed by up to 30x in total tokens [2].

Second, more tokens did not lead to better success. Accuracy often peaked at intermediate cost and then flattened or degraded at higher spend levels [2].

Third, expensive runs often came from repeated file reads, repeated edits, and other redundant behavior, not from smarter reasoning [2].

That's why cost per completed task is the better lens. It captures the ugly truth: a long, expensive agent trace that fails is not "more work." It's just more waste.

How should teams compare models in 2026?

Teams should compare models on actual workload outcomes, using the same tasks, same success criteria, and the same billing rules. The goal is not to find the cheapest model on paper. It is to find the model with the best reliability-adjusted cost.

Here's the table I'd use internally:

Metric	What it tells you	Why it matters
Cost per token	Unit price of inference	Good for rough budgeting, weak for model choice
Cost per request	Average price of one call	Useful for simple apps, weak for agents
Success rate	Share of tasks truly completed	Critical, but incomplete without cost
Cost per completed task	Spend required for one successful outcome	Best single metric for product decisions
Cost of failure	Spend lost on unsuccessful runs	Essential for agent and tool workflows

What I noticed is that most teams already have the logs needed for this. They just don't roll them up the right way. If you can measure request cost and task success, you can measure the metric that matters.

A related paper on agent benchmarking makes the same broader point from another angle: evaluation itself is expensive, and per-task costs vary wildly, with SWE-bench Verified runs ranging from $0.08 to $32 per task depending on model and scaffold [3]. That spread alone should kill the idea that "token price" is enough.

How do you improve cost per completed task in practice?

You improve cost per completed task by reducing wasted context, tightening prompts, choosing models by workload, and measuring failures aggressively. The best gains usually come from workflow design, not just model switching.

Here's a simple before-and-after prompt example:

Before:

Fix this bug in my repo.

After:

You are debugging a Python repository. 
Goal: identify the root cause of the failing authentication flow and propose the smallest safe fix.
Constraints: do not refactor unrelated modules, list assumptions, and stop after proposing one primary fix plus one fallback.
Success criteria: tests for auth pass, no changes outside auth-related files unless required.

The second prompt does three useful things. It narrows scope. It defines success. And it reduces wandering. That matters because wandering is expensive.

This is also where prompt tooling earns its keep. If your team writes prompts all day in Slack, your IDE, or docs, Rephrase can turn rough instructions into tighter, skill-specific prompts in a couple of seconds. It won't solve model economics by itself, but it can cut a surprising amount of ambiguity before the tokens start flowing.

For deeper reading on practical prompting workflows, I'd also point people to the Rephrase blog, because the fastest way to lower cost per completed task is often to ask better in the first place.

What metric should you report to leadership?

You should report cost per completed task alongside success rate and total spend, because leaders care about delivered outcomes, not token trivia. This framing makes AI costs legible to product, finance, and operations teams at the same time.

If you tell a VP, "Model X is $3 cheaper per million tokens," that sounds technical but not meaningful. If you tell them, "Model X delivers one successful ticket resolution for $0.84 versus $1.37 on Model Y," that is instantly clear.

There's also a strategic benefit. Once you report outcome-based metrics, prompt quality, tool design, retry policy, and model selection all become part of the same optimization loop. That's the right loop.

The Reddit chatter around newer usage-based pricing shifts is noisy, but one thing it gets right is this: the flat-rate fantasy is ending, and heavy agent use is being repriced closer to real consumption. Teams without outcome-level visibility are going to get surprised [4].

The short version is simple. Cost per token is a component. Cost per completed task is the business metric. In 2026, if you're still choosing models by sticker price alone, you're probably optimizing the wrong line on the spreadsheet.

If you want a practical next step, take one production workflow, run the same 50 tasks across two or three model setups, and compute the real number: dollars spent per successful completion. Once you see that number, it's hard to go back.

References

Documentation & Research

The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More - arXiv cs.CL (link)
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks - arXiv cs.CL (link)
Efficient Benchmarking of AI Agents - arXiv cs.AI (link)
$OneMillion-Bench: How Far are Language Agents from Human Experts? - arXiv cs.LG (link)

Community Examples 5. Copilot just 9x'd Sonnet and 27x'd Opus and teams have no idea - r/ChatGPT (link)

Frequently asked

What is cost per completed task in AI?

Cost per completed task measures how much you spend, on average, to get a task successfully finished. It combines both price and success rate, which makes it more useful than looking at token pricing alone.

How do I calculate cost per completed task?

A practical formula is total spend divided by successful completions. If you want to compare models fairly, run the same task set, track actual billed cost, and divide by the number of tasks that truly pass.