Learn how to design minimal tool sets for AI agents that stay accurate, reduce tool confusion, and scale better in production. Try free.
Most agent failures don't come from weak models. They come from bad tool design. Give an agent too many overlapping tools, and it starts guessing instead of reasoning.
A minimal viable tool set is the smallest set of tools that lets an agent complete its core job reliably. The point is not elegance for its own sake. The point is reducing ambiguity so the model spends less time choosing between tools and more time solving the actual task. [1][2]
Here's my rule: design tools the way you'd design a small API for a junior engineer on day one. If they can understand what to use without asking follow-up questions, you're close.
In the agent literature, tool use works best when reasoning and acting are tightly coupled, but the context provided to the model stays concise and well scoped. The recent survey Agentic Reasoning for Large Language Models makes this explicit: large or complex toolsets degrade performance, while well-written tool documentation and orchestration improve zero-shot use. [1] That's the academic version of a very practical lesson: clutter breaks agents.
A second paper pushes the point even harder. Act Wisely describes "blind tool invocation," where agents call tools reflexively even when the answer is already available from context. The result is more latency, more noise, and sometimes worse accuracy. [2] In other words, more tools do not just cost more. They can make the model dumber in practice.
Large tool sets confuse the model because tool selection becomes a reasoning problem on top of the original task. Every extra tool adds description tokens, overlap, edge cases, and possible failure paths, which increases the odds of wrong or unnecessary calls. [1][2]
This is where many teams get ambitious too early. They build a "universal" agent with search, CRM actions, project management, email, browser control, database access, and internal APIs. Then they wonder why it keeps choosing the wrong function or asking odd follow-up questions.
The Google Cloud guide to production-ready agents emphasizes that agents need deliberate orchestration, testing, and system design rather than raw capability stuffing. [3] That matches what the Viktor team described in a production write-up: dumping hundreds of tool schemas into context made the model slower and more confused, while one-line skill summaries plus lazy loading worked better in practice. [4] That's a community source, so I wouldn't treat it as gospel, but it's a useful real-world example of the same principle.
The catch is that confusion is rarely caused by tool count alone. It usually comes from three design mistakes happening together: overlapping tools, vague descriptions, and no gating.
You choose the first tools by starting from one job, not from all available integrations. Define the agent's core task, identify the smallest external actions required, and remove any tool that is nice to have but not necessary for first-pass success. [1][3]
I like to use a simple numbered process here:
This last point matters more than people think. If an agent can inspect state before it changes state, it makes fewer irreversible mistakes.
For a support agent, the minimal tool set might be just three tools: get_customer_by_email, get_order_status, and create_refund_request. Not twelve tools. Not "search_database." Not separate variants for every table. The survey on agentic reasoning specifically notes that explicit names and clear schemas improve reliable use. A tool called search_customer_orders_by_email is better than search_database because it tells the model exactly what job it does. [1]
A clear tool has a narrow purpose, explicit inputs, and structured outputs that are easy for the model to reason over. Clarity beats flexibility because language models are better at choosing between distinct options than interpreting broad, multifunction tools. [1]
Here's what I've noticed: teams often design tools for backend reuse, not model usability. Those are different goals.
| Tool design choice | Confusing version | Better version |
|---|---|---|
| Name | search_database |
find_invoice_by_customer_id |
| Scope | Does many unrelated queries | Does one job well |
| Input schema | Free-form text | Typed fields with constraints |
| Output | Paragraph summary | JSON with fixed keys |
| Error handling | Generic failure text | Explicit error code and recovery hint |
That JSON point matters. The survey explicitly recommends structured outputs rather than prose because the model can parse and reuse them more reliably in later steps. [1]
Here's a before-and-after prompt design example.
Before:
You can use any of these 14 company tools to help the user. Choose the best one and solve the request.
After:
You are a refund support agent.
Available tools:
1. get_customer_by_email(email) -> {customer_id, account_status}
2. get_order_status(order_id) -> {order_status, delivered_at, refund_eligible}
3. create_refund_request(order_id, reason) -> {request_id, status}
Use tools only when needed.
If refund_eligible is false, explain why and do not create a request.
Return a short final answer for the customer.
Same model. Much better odds.
If you want help rewriting messy tool instructions or agent prompts across apps, Rephrase is useful for tightening vague text before it ever reaches your model.
Several specialized agents usually outperform one general agent when the jobs, context, or tools are meaningfully different. Smaller tool sets lower cognitive load and make behavior easier to test, debug, and trust in production. [2][3][4]
This is one of those cases where product simplicity and model performance align. A travel planner, a calendar assistant, and a support triage agent should not all share the same prompt and tool inventory unless there is a strong reason.
The Act Wisely paper frames this as a meta-cognitive problem: the model must learn when not to use a tool. [2] That gets much harder when the environment is crowded. The Google guidance on production agents also leans toward intentional architecture and lifecycle design rather than all-in-one setups. [3] And the Viktor article gives a practical example: instead of loading everything, they expose compact "skills" and expand details only when needed. [4]
That pattern is worth stealing even if your stack is totally different. Keep the hot path small. Load detail on demand.
For more articles on writing tighter AI instructions and workflows, the Rephrase blog is worth bookmarking.
Expand a tool set only after you can prove a missing capability in logs or evaluations. Add one tool at a time, test whether it reduces failure on a specific task, and keep the rest of the prompt stable so you can see what changed. [1][3]
This is where discipline matters. Don't add tools because a stakeholder says, "We might need this later." Add them because the agent is repeatedly failing a real task that the new tool would solve.
My preferred expansion order is simple. First, improve descriptions. Second, merge or remove overlap. Third, add a new tool only if those fixes don't solve the issue. In a lot of teams, "we need more tools" is really "our current tools are badly named."
And yes, prompt hygiene matters here too. If you're iterating on agent instructions all day in Slack, your IDE, or docs, Rephrase's homepage shows the kind of shortcut-based workflow that makes this less painful.
The big idea is boring, which is why it works: give agents fewer choices, better descriptions, and clearer boundaries. Minimal viable tool sets don't limit intelligence. They remove noise so the model can use its intelligence where it actually matters.
Documentation & Research
Community Examples 4. What Breaks When Your Agent Has 100,000 Tools - Viktor blog / Hacker News (LLM) (link)
There is no universal number, but fewer is usually better at the start. Give an agent only the tools required to complete its core job, then add more only when logs show a real gap.
A good tool description is specific, narrow, and unambiguous. It should clearly say when to use the tool, what inputs it expects, and what output shape it returns.