Learn how to use Claude task budgets and xhigh effort to control reasoning spend in long-running agents without killing quality. See examples inside.
Long-running agents have a bad habit: they burn budget on the wrong parts of the job. The model thinks hard where it shouldn't, then arrives underpowered when the task actually gets difficult.
Claude Opus 4.7 gives developers two new knobs for controlling reasoning spend: xhigh effort for finer thinking control, and task budgets for guiding token allocation over longer agent runs. Anthropic positions them as production levers for managing the tradeoff between reasoning quality, latency, and cost in hard, multi-step workflows.[1]
That matters because agent cost rarely explodes in one giant call. It leaks through dozens or hundreds of slightly-too-expensive calls. If you run coding agents, research agents, or support agents overnight, this is where margins disappear.
The interesting bit is that xhigh sits between high and max. That sounds minor, but it fixes a real operational problem. "High" can be too shallow for planning-heavy steps. "Max" can be overkill. xhigh gives you a middle lane.
Task budgets solve a different issue. They let you guide overall spend across longer runs instead of treating every step like an isolated reasoning event.[1] In practice, that means your agent can be taught not to blow premium thinking on low-value subtasks.
Reasoning spend gets out of control when agents apply expensive deliberation uniformly across a trajectory, even though long-running tasks contain a mix of cheap steps, hard steps, and dead ends. Research on long-horizon agent benchmarks shows that planning quality, not raw tool access, is often the real bottleneck.[2][3]
Here's what I noticed reading the benchmarks: the expensive part is not always "generation." It's the churn. Agents reread, over-plan, re-check, and keep burning tokens while making weak strategic choices.
In EnterpriseOps-Gym, performance drops as task horizon increases, and more thinking budget helps in many domains but not all. Some workflows improve sharply with higher thinking budgets, while others plateau early or even peak at medium effort.[2] That is the key lesson. More reasoning is not a universal fix.
In YC-Bench, long-term success depends heavily on maintaining useful memory and avoiding repeated bad decisions. The winning models don't just "think more." They preserve strategy, blacklist bad patterns, and stay coherent over time.[3]
So if your agent keeps getting expensive, the issue may not be that it needs more compute. It may be that it lacks budget discipline.
Use lower effort for routine steps, high or xhigh for decision-heavy steps, and reserve the most expensive reasoning only for moments when better planning changes the outcome. Anthropic recommends starting with high or xhigh for coding and agentic use cases, not blindly pushing every call to maximum.[1]
I'd treat effort like a routing policy, not a model preference. Cheap steps stay cheap. Hard steps earn extra thinking. Unknown steps start moderate, then escalate only if they fail.
Here's a simple comparison:
| Step type | Recommended effort | Why |
|---|---|---|
| Retrieval, classification, extraction | low / medium | Little value from extended reasoning |
| Simple tool call formatting | low | Mostly deterministic |
| Multi-step planning | high / xhigh | Better reasoning changes downstream success |
| Debugging ambiguous failures | xhigh | Worth spending for diagnosis |
| Final verification on critical tasks | xhigh / max | Expensive, but sometimes justified |
This routing approach matches what the research suggests. In EnterpriseOps-Gym, giving weaker executors better plans produced bigger gains than simply throwing more complexity at the agent architecture.[2]
The best way to control reasoning spend is to make budget policy explicit in the prompt: tell the agent when to stay cheap, when to escalate effort, and what conditions justify deeper thinking. If you don't specify this, the agent has to invent its own spend policy.
Here's a weak prompt:
Solve this task thoroughly and use deep reasoning when needed.
That sounds fine. It's also vague enough to invite overthinking.
Here's a better version:
You are a cost-aware long-running agent.
Budget policy:
- Use minimal reasoning for retrieval, extraction, formatting, and straightforward tool calls.
- Use high effort for planning, ambiguity resolution, and dependency ordering.
- Escalate to xhigh only when a mistake would create expensive downstream rework or when prior attempts failed.
- Before escalating, briefly state why the higher effort is justified.
- Prefer concise intermediate reasoning and avoid re-evaluating settled decisions unless new evidence appears.
Execution policy:
1. Classify the current step as routine, planning-heavy, or failure-recovery.
2. Choose the cheapest effort level that can reliably complete it.
3. Preserve key decisions in a compact running memory.
4. If blocked, propose one high-value next action instead of exploring multiple branches.
That prompt does two things. First, it narrows the spend policy. Second, it prevents the classic agent move of reopening settled questions for no reason.
If you want more examples like this, the Rephrase blog has more articles on practical prompting patterns for real workflows.
A budget-aware agent workflow separates planning from execution, spends more on the plan, and keeps a compact memory so the model does not repeatedly pay to rediscover the same facts. This approach lines up with benchmark findings on planning bottlenecks and memory-driven success.[2][3]
I like a three-phase pattern:
The "before → after" shift looks like this:
| Before | After |
|---|---|
| Every agent turn uses the same high-effort policy | Planning and recovery use high/xhigh; execution uses low/medium |
| No memory discipline | Agent stores compact decisions and constraints |
| Re-checks settled issues repeatedly | Reopens decisions only with new evidence |
| Spends heavily at the start and drifts | Preserves budget for hard branches later |
This is also where prompt rewriting tools are genuinely useful. If you write "be thorough but efficient," that's not enough. A tool like Rephrase can turn that fuzzy instruction into a specific budget policy with escalation rules.
When task budgets are enabled, watch for two failure modes: under-spending on critical planning and over-spending on repeated low-value loops. The goal is not just to reduce tokens. It's to spend them where they produce better outcomes per dollar.
I'd monitor four things in production: total tokens per completed task, tokens spent before first useful action, number of retries per branch, and success rate after escalation. Those tell you whether the budget policy is helping or just suppressing thought.
One more thing: benchmarks on long-running agents keep showing that memory quality matters. In YC-Bench, scratchpad usage was one of the strongest predictors of success because it prevented the model from relearning the same painful lesson over and over.[3] That means a budget policy without memory is incomplete.
Try this the next time your Claude agent gets expensive: don't just lower the budget. Rewrite the spend policy. Tell the agent exactly what deserves xhigh effort and what absolutely does not. That's the difference between "cheaper" and "actually efficient."
Documentation & Research
Community Examples 5. End to end Task Orchestration with Claude AI (Free Plan) - r/PromptEngineering (link)
xhigh is a reasoning effort level between high and max. It gives you a finer tradeoff between quality, latency, and cost on hard tasks, especially in coding and agent workflows.
No. xhigh is best for difficult steps where better planning changes the outcome. For routine retrieval, formatting, or simple tool calls, lower effort is usually cheaper and fast enough.