Blog / Prompt engineering / Why Cost Per Task Beats Cost Per Token

Why Cost Per Task Beats Cost Per Token

Discover why cost per completed task is a better AI metric than cost per token in 2026, and how to compare models the right way. Read the full guide.

Ilia Ilinskii
Rephrase · May 11, 2026

Prompt engineering7 min read

On this page

Key Takeaways Why does cost per completed task matter more than cost per token?Why is cost per token such a weak metric in 2026?How should teams measure AI cost in practice?Do more tokens lead to better completed tasks?How do you compare models using cost per completed task?What should you track beyond cost per completed task?References

Most AI teams still talk about token pricing like it's the whole story. It isn't. In May 2026, the real question is simpler: how much did it cost to get the job done?

Key Takeaways

Cost per completed task is more useful than cost per token because it measures outcomes, not raw consumption.
Lower listed token prices can still produce higher real costs when models use far more thinking or input tokens.[1]
In agentic workflows, more tokens often do not mean better success, and expensive runs can be dominated by redundant work.[2]
The right evaluation unit is a completed task with quality thresholds, retries, and failure costs included.
If you compare models only on API pricing pages, you will probably pick the wrong winner for real workloads.

Why does cost per completed task matter more than cost per token?

Cost per completed task matters more because businesses buy outcomes, not tokens. A model with a low per-token price can still be expensive if it fails often, thinks too long, or needs repeated attempts. The metric that matters is what you spend to reach an acceptable result, consistently, on your actual workflow.[1][2]

Here's the catch. Token pricing is a unit price. It tells you the cost of ingredients, not the cost of the meal. If one model burns through huge hidden reasoning traces, loops through the same files, or fails and needs a rerun, the "cheap" model stops being cheap very fast.

A recent pricing study found a pricing reversal phenomenon: in 21.8% of model-pair comparisons, the model with the lower listed price actually had the higher real cost.[1] That's not a rounding error. That's a broken buying heuristic.

For teams shipping products, the better formula is brutally practical:

Cost per completed task = Total spend / Number of successfully completed tasks

If your acceptance bar includes correctness, latency, or human review, include those too. Otherwise you are optimizing for the wrong thing.

Why is cost per token such a weak metric in 2026?

Cost per token is weak in 2026 because token counts vary wildly across models, tasks, and repeated runs of the same task. The same listed price can produce very different real costs once you factor in reasoning tokens, caching behavior, retries, and agent loops.[1][2]

The research is getting pretty clear on this point.

One paper on reasoning models shows that thinking tokens are now a major driver of real cost. On the same query, one model may use 900% more thinking tokens than another.[1] That is exactly how a model with "better pricing" ends up costing more.

Another paper on coding agents found something equally important: in long-running agent workflows, input tokens dominate overall cost, not output tokens, even with caching enabled.[2] Agentic coding tasks consumed around 1000x more tokens than simpler code chat or reasoning tasks, and runs on the same task could differ by up to 30x in total token usage.[2]

That means cost per token misses three practical realities. First, token volume is unstable. Second, token type matters. Third, success is not guaranteed.

I've noticed that teams often compare model pricing pages side by side and assume they've done cost analysis. They haven't. They've done tariff comparison, not workload economics.

How should teams measure AI cost in practice?

Teams should measure AI cost at the workload level by combining spend, success rate, and quality thresholds into one number. The cleanest version is cost per completed task, supported by secondary metrics like pass rate, latency, and retry rate.[1][2]

If I were setting this up today, I'd use a scorecard like this:

Metric	What it tells you	Why it matters
Cost per completed task	Dollars per accepted result	Primary business metric
Task success rate	Fraction of tasks that pass	Shows reliability
Retry rate	How often reruns are needed	Captures hidden waste
Median latency	Time to usable result	Affects UX and ops
Token cost breakdown	Input, output, thinking, cache	Useful for diagnosis, not final ranking

This is also where prompt quality matters more than people think. Better prompts can reduce retries, constrain wandering, and raise first-pass success. Tools like Rephrase help by turning rough instructions into tighter prompts that fit the task and tool automatically, which is often the fastest way to improve cost per completed task without changing models.

If you want more articles on prompt optimization and AI workflows, the Rephrase blog is worth bookmarking.

Do more tokens lead to better completed tasks?

More tokens do not reliably lead to better completed tasks. In several recent studies, performance improved only up to a point, then flattened or even worsened as token usage rose. Past that point, extra spend often reflects redundancy, not progress.[1][2]

This point is easy to miss because "more reasoning" sounds like "more intelligence." But that's not always what happens in production.

In agentic coding research, accuracy often peaked at intermediate cost and then saturated or declined at higher cost levels.[2] The more expensive runs were frequently associated with repeated file views and repeated edits, which signals thrashing rather than effective problem solving.[2]

That matches what I've seen in the wild. Expensive failures don't usually look like brilliant deep thought. They look like the model getting lost politely.

Here's a simple before-and-after framing I'd use internally:

Before	After
"Model A is cheaper because input/output tokens cost less."	"Model B is cheaper for our workflow because it completes more tasks with fewer retries."
"This run used more tokens, so it probably reasoned harder."	"This run used more tokens, but it may have looped or overthought."
"We need lower token prices."	"We need higher first-pass completion at acceptable quality."

That shift in language changes decision-making fast.

How do you compare models using cost per completed task?

To compare models fairly, run the same task set, apply the same quality bar, and divide total spend by passed tasks. This exposes pricing reversals, retry overhead, and workflow-specific inefficiencies that per-token pricing hides.[1][2]

A lightweight evaluation process works well:

Define what "completed" means. That could be correct output, passing tests, acceptable design quality, or approved customer reply.
Build a representative task set. Not a benchmark you found on social media. Your real tasks.
Run each model with the same prompts, tools, and budget rules.
Record total cost, passed tasks, latency, and retries.
Compute cost per completed task and rank on that first.

For example, if Model X costs $120 across 200 runs and completes 80 tasks, its cost per completed task is $1.50. If Model Y costs $150 but completes 130 tasks, its cost per completed task is about $1.15. Model Y is the better economic choice, even though its raw spend is higher.

This is also why prompt engineering deserves a seat in cost discussions. A stronger prompt can improve task completion enough to beat a cheaper model with a weaker prompt. In practice, that's often the highest-ROI optimization available. If you frequently switch between ChatGPT, Claude, coding tools, Slack, and design apps, Rephrase is useful because it rewrites the same rough instruction for the right context in a couple of seconds.

What should you track beyond cost per completed task?

Cost per completed task should be the headline metric, but not the only one. You still need latency, variance, and failure analysis to understand why a model is cheap or expensive on your workload.[1][2]

Here's what I'd add around it. Track variance because repeated runs can swing heavily, especially on reasoning-heavy or agentic tasks.[1][2] Track failure modes because some models burn money by hesitating while others fail fast. And track human correction time if humans are in the loop, since cheap low-quality outputs can quietly shift cost downstream.

What's interesting is that both major papers point to the same conclusion from different angles. One shows listed prices can mislead because hidden reasoning tokens distort real cost.[1] The other shows agentic tasks create unstable, input-heavy spending that doesn't map cleanly to success.[2] Put them together, and the case is pretty strong: token price is an accounting input, not a decision metric.

If you remember one thing, make it this: don't buy tokens, buy completed work. That means benchmarking models on the tasks you actually care about, with prompts and workflows close to production, and ranking them by cost per accepted outcome. Everything else is just a proxy.

References

Documentation & Research

The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More - arXiv (link)
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks - arXiv (link)
ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory - arXiv (link)

Community Examples 4. No supplementary community source used; Tier 1 sources were sufficient for this article.

Frequently asked

What is cost per completed task in AI?

Cost per completed task measures how much you spend to get a usable outcome, not just how many tokens a model consumed. It folds in success rate, retries, and wasted runs.

How do I calculate cost per completed task?

Take total spend for a workload and divide it by the number of successfully completed tasks. If you want a stricter number, include retries, human review, and failed runs in the total cost.