Learn how reasoning effort parameters like high, xhigh, and max changed AI model selection, plus when to spend more tokens in production. See examples inside.
The old question was, "Which reasoning model should I use?" The new question is sharper: "How much reasoning should this request buy?"
Reasoning effort moved "thinking" from a separate model category into an adjustable inference-time control. Instead of switching between a fast model and a dedicated reasoning model, many modern APIs let developers request more or less reasoning within the same model family, using settings such as low, medium, high, xhigh, or max.
That sounds like a small API design change. It is not. It changes product architecture.
A year ago, teams often had two routes: one cheap chat model for everyday requests and one expensive reasoning model for math, coding, analysis, or agents. Now the pattern is more granular. A single workflow can call the same model at low effort for extraction, high effort for planning, and max effort for irreversible decisions.
Claude Opus 4.6 coverage, for example, describes an /effort control with low, medium, high, and max, plus adaptive thinking for deciding when extended reasoning is needed [5]. Gemini's thinking controls similarly expose a thinkingBudget concept, though research using Gemini notes that the official budget is a guide rather than a hard cap [4]. In practice, this means developers are no longer only routing across models. They are routing across compute modes.
Here is my take: this is a better abstraction. "Reasoning model" was too coarse. "Reasoning effort" is closer to how real software needs AI to behave.
High, xhigh, and max replaced separate reasoning models because they let teams trade accuracy, latency, and cost per request without rebuilding their model stack. The same agent or app can reserve deep reasoning for difficult steps while using cheaper effort settings for predictable work, which is more flexible than a binary reasoning-versus-non-reasoning split.
This is especially useful for product teams. Most AI products are not one prompt. They are pipelines.
A code assistant may summarize files, inspect an error, propose a patch, run tests, and revise. Only some of those steps deserve top-tier reasoning. A support agent may classify intent at low effort, retrieve policy at medium effort, and escalate to high effort when issuing a refund.
The ARES paper makes this precise. It argues that fixed effort strategies are wasteful: always-low damages performance, while always-high burns tokens. A lightweight router that predicts low, medium, or high effort per agent step reduced reasoning token usage by up to 52.7% while preserving task success in several agent benchmarks [3].
That is the "replacement" story. Reasoning did not disappear into one supermodel. It became a knob that orchestration layers can tune.
| Old workflow | New workflow |
|---|---|
| Choose fast model or reasoning model | Choose model plus effort level |
| Route whole task once | Route effort per step |
| Pay high cost for entire run | Pay high cost only when needed |
| Optimize prompt only | Optimize prompt, context, and compute budget |
| Evaluate model accuracy | Evaluate accuracy per dollar and latency |
If you write prompts manually, this also affects prompt wording. A vague "think hard" instruction is weaker than a clear task specification plus an explicit effort policy. Tools like Rephrase can help rewrite rough instructions into prompts that make the task difficulty and success criteria obvious before you spend extra reasoning tokens.
Max reasoning effort is best reserved for tasks where errors are expensive, context is complex, or the model must coordinate many constraints. Research does not support using max by default. Higher effort can improve context learning, but it can also plateau, waste tokens, or even degrade performance when the task rewards direct pattern recognition.
CL-bench is a useful reality check. The benchmark tested frontier models on tasks requiring them to learn new knowledge from long, complex contexts. Higher reasoning effort generally helped, but modestly. GPT-5.1 improved from 21.2% to 23.7% when moving from low/no reasoning to high effort, while GPT-5.2 showed essentially no average gain [2].
That is not a "max everything" result. It is a "use extra effort when the task has real learning or reasoning load" result.
Another paper, Effort as Ceiling, Not Dial, is even more sobering. It tested GPT-OSS-20B and GPT-OSS-120B across low, medium, and high effort. The authors found that effort settings only modestly changed trace length, and alignment between model token usage and human cognitive cost stayed nearly invariant [1]. Their interpretation is important: the effort parameter may set an upper budget, while the model's actual allocation policy is mostly learned during training.
So when you set max, you are not necessarily forcing brilliant deliberation. You may simply be allowing more internal tokens if the model chooses to use them.
A community report on deep research tasks found a similar warning sign: higher effort reduced accuracy for GPT-5 and Gemini Flash 3 on their benchmark while increasing cost [6]. That is not Tier 1 evidence, but it matches what many builders notice in production. More thinking can become overthinking.
Developers should choose reasoning effort by matching the setting to task uncertainty, failure cost, and context complexity. Use low for deterministic transformations, medium for multi-step but bounded work, high for ambiguous reasoning or coding, xhigh for difficult agentic tasks, and max only when the cost of a wrong answer exceeds the cost of extra tokens.
Here is a practical policy I like.
Start with the cheapest effort that can plausibly solve the task. Measure failure modes. Escalate only when there is evidence: long context, conflicting constraints, tool sequencing, multi-file code changes, legal/financial analysis, or failed self-checks.
For agent systems, avoid one global setting. The ARES results show why: effort should be selected per step, because "open the target URL" and "recover from a bad navigation path" are not the same cognitive task [3].
For single-response prompts, use the prompt itself to signal difficulty. Do not just ask for "deep reasoning." Explain the stakes, constraints, and verification criteria.
Before:
Analyze this incident report and tell me what happened.
After:
Analyze this incident report for root cause. Use high reasoning effort if available. Identify the timeline, distinguish facts from assumptions, list the most likely failure chain, and provide a confidence-rated remediation plan. If evidence is missing, say exactly what logs or metrics would resolve the uncertainty.
For simple work, do the opposite.
Before:
Use max reasoning and rewrite this Slack message.
After:
Rewrite this Slack message to be concise, friendly, and clear. Do not add new information. Keep it under 80 words. Use low reasoning effort if available.
That second prompt saves money because it tells the system the task is bounded. If you want more examples of prompt patterns like this, the Rephrase blog has practical guides for turning vague requests into structured prompts.
Reasoning effort makes prompt engineering more operational. A good prompt no longer just describes the answer; it describes the amount of cognitive work the task deserves. The best prompts separate task instructions, success criteria, evidence requirements, and compute policy so the model can spend reasoning where it matters.
This is the part many teams miss.
If your prompt is messy, max effort often amplifies the mess. The model spends more time wandering through unclear goals. If your prompt is structured, higher effort has something useful to optimize against.
A better prompt for effort-controlled models usually contains four elements in prose: what the task is, what evidence to use, what constraints matter, and what would make the answer unacceptable.
For example:
You are reviewing a pull request that changes billing logic. Use high or xhigh reasoning effort if available because mistakes could affect invoices.
Review only the diff and the billing rules below. Check for calculation errors, missing edge cases, backwards compatibility risks, and test coverage gaps. Return a table with issue, severity, evidence, and recommended fix. If no issue is found, explain the strongest reason the change appears safe.
Notice the phrase "because mistakes could affect invoices." That is not fluff. It gives the system a reason to spend more compute.
For daily writing, I would not do this manually every time. I would use a prompt-improvement layer like Rephrase to turn raw intent into a clean prompt, then set effort based on the task class.
The future of reasoning effort is adaptive routing. Static settings like high, xhigh, and max are transitional controls. The better system is one that estimates task difficulty, picks the minimum sufficient effort, observes uncertainty or failure, and escalates only when the next step truly needs deeper reasoning.
That is where the field is clearly moving.
Research on matched thinking-token budgets shows that compute accounting matters. In one study, single-agent systems often matched or beat multi-agent systems when thinking budgets were controlled, and the authors warned that API-reported thinking counts can be opaque or misleading [4]. That matters because "more agents" and "more effort" can both look good simply because they spend more compute.
The next layer is budget-aware orchestration. Your app should know when to buy more reasoning and when to stop.
My rule of thumb is simple: if the task has clear inputs, clear output format, and low cost of failure, keep effort low. If the task has long context, hidden dependencies, tool use, or high cost of failure, move up. If the model has already failed once and the second attempt needs diagnosis, that is when xhigh or max earns its keep.
The knob is powerful. But the discipline is knowing when not to turn it.
Documentation & Research
Community Examples
Reasoning effort is an API or model setting that controls how much internal computation a model may spend before answering. Higher settings usually cost more and add latency, but they can improve hard reasoning, coding, and agent tasks.
The labels vary by provider, but they generally represent increasing reasoning budgets. High is a strong default for hard tasks, xhigh adds a finer tier before the top setting, and max asks the model to use its deepest available reasoning mode.