Blog / Prompt engineering / How to Design Lean Tool Sets for AI Agen…

How to Design Lean Tool Sets for AI Agents

Learn how to design minimal tool sets for AI agents that stay accurate, reduce tool confusion, and scale better in production. Try free.

Ilia Ilinskii
Rephrase · April 18, 2026

Prompt engineering8 min read

On this page

Key Takeaways What is a minimal viable tool set for an AI agent?Why do large tool sets confuse the model?How do you choose the first tools an agent really needs?What makes one tool clearer than another?Should you use one general agent or several specialized agents?How should you expand a tool set without breaking the agent?References

Most agent failures don't come from weak models. They come from bad tool design. Give an agent too many overlapping tools, and it starts guessing instead of reasoning.

Key Takeaways

Minimal viable tool sets outperform bloated ones because they reduce selection errors, latency, and context waste.
Tool confusion usually comes from overlap, vague descriptions, and exposing capabilities the model does not need.
Research on agentic reasoning and tool use shows that clear documentation and selective orchestration improve tool behavior. [1][2]
The best production pattern is narrow tools, structured outputs, and progressive disclosure instead of dumping everything into the prompt. [1][3]

What is a minimal viable tool set for an AI agent?

A minimal viable tool set is the smallest set of tools that lets an agent complete its core job reliably. The point is not elegance for its own sake. The point is reducing ambiguity so the model spends less time choosing between tools and more time solving the actual task. [1][2]

Here's my rule: design tools the way you'd design a small API for a junior engineer on day one. If they can understand what to use without asking follow-up questions, you're close.

In the agent literature, tool use works best when reasoning and acting are tightly coupled, but the context provided to the model stays concise and well scoped. The recent survey Agentic Reasoning for Large Language Models makes this explicit: large or complex toolsets degrade performance, while well-written tool documentation and orchestration improve zero-shot use. [1] That's the academic version of a very practical lesson: clutter breaks agents.

A second paper pushes the point even harder. Act Wisely describes "blind tool invocation," where agents call tools reflexively even when the answer is already available from context. The result is more latency, more noise, and sometimes worse accuracy. [2] In other words, more tools do not just cost more. They can make the model dumber in practice.

Why do large tool sets confuse the model?

Large tool sets confuse the model because tool selection becomes a reasoning problem on top of the original task. Every extra tool adds description tokens, overlap, edge cases, and possible failure paths, which increases the odds of wrong or unnecessary calls. [1][2]

This is where many teams get ambitious too early. They build a "universal" agent with search, CRM actions, project management, email, browser control, database access, and internal APIs. Then they wonder why it keeps choosing the wrong function or asking odd follow-up questions.

The Google Cloud guide to production-ready agents emphasizes that agents need deliberate orchestration, testing, and system design rather than raw capability stuffing. [3] That matches what the Viktor team described in a production write-up: dumping hundreds of tool schemas into context made the model slower and more confused, while one-line skill summaries plus lazy loading worked better in practice. [4] That's a community source, so I wouldn't treat it as gospel, but it's a useful real-world example of the same principle.

The catch is that confusion is rarely caused by tool count alone. It usually comes from three design mistakes happening together: overlapping tools, vague descriptions, and no gating.

How do you choose the first tools an agent really needs?

You choose the first tools by starting from one job, not from all available integrations. Define the agent's core task, identify the smallest external actions required, and remove any tool that is nice to have but not necessary for first-pass success. [1][3]

I like to use a simple numbered process here:

Define one narrow success metric. For example: "Resolve support tickets that need account lookup and refund eligibility checks."
List the external actions required to hit that metric.
Merge overlapping actions into one opinionated tool if they always occur together.
Remove tools that only help with edge cases.
Add read-only tools before write tools when possible.

This last point matters more than people think. If an agent can inspect state before it changes state, it makes fewer irreversible mistakes.

For a support agent, the minimal tool set might be just three tools: get_customer_by_email, get_order_status, and create_refund_request. Not twelve tools. Not "search_database." Not separate variants for every table. The survey on agentic reasoning specifically notes that explicit names and clear schemas improve reliable use. A tool called search_customer_orders_by_email is better than search_database because it tells the model exactly what job it does. [1]

What makes one tool clearer than another?

A clear tool has a narrow purpose, explicit inputs, and structured outputs that are easy for the model to reason over. Clarity beats flexibility because language models are better at choosing between distinct options than interpreting broad, multifunction tools. [1]

Here's what I've noticed: teams often design tools for backend reuse, not model usability. Those are different goals.

Tool design choice	Confusing version	Better version
Name	`search_database`	`find_invoice_by_customer_id`
Scope	Does many unrelated queries	Does one job well
Input schema	Free-form text	Typed fields with constraints
Output	Paragraph summary	JSON with fixed keys
Error handling	Generic failure text	Explicit error code and recovery hint

That JSON point matters. The survey explicitly recommends structured outputs rather than prose because the model can parse and reuse them more reliably in later steps. [1]

Here's a before-and-after prompt design example.

Before:

You can use any of these 14 company tools to help the user. Choose the best one and solve the request.

After:

You are a refund support agent.

Available tools:
1. get_customer_by_email(email) -> {customer_id, account_status}
2. get_order_status(order_id) -> {order_status, delivered_at, refund_eligible}
3. create_refund_request(order_id, reason) -> {request_id, status}

Use tools only when needed.
If refund_eligible is false, explain why and do not create a request.
Return a short final answer for the customer.

Same model. Much better odds.

If you want help rewriting messy tool instructions or agent prompts across apps, Rephrase is useful for tightening vague text before it ever reaches your model.

Should you use one general agent or several specialized agents?

Several specialized agents usually outperform one general agent when the jobs, context, or tools are meaningfully different. Smaller tool sets lower cognitive load and make behavior easier to test, debug, and trust in production. [2][3][4]

This is one of those cases where product simplicity and model performance align. A travel planner, a calendar assistant, and a support triage agent should not all share the same prompt and tool inventory unless there is a strong reason.

The Act Wisely paper frames this as a meta-cognitive problem: the model must learn when not to use a tool. [2] That gets much harder when the environment is crowded. The Google guidance on production agents also leans toward intentional architecture and lifecycle design rather than all-in-one setups. [3] And the Viktor article gives a practical example: instead of loading everything, they expose compact "skills" and expand details only when needed. [4]

That pattern is worth stealing even if your stack is totally different. Keep the hot path small. Load detail on demand.

For more articles on writing tighter AI instructions and workflows, the Rephrase blog is worth bookmarking.

How should you expand a tool set without breaking the agent?

Expand a tool set only after you can prove a missing capability in logs or evaluations. Add one tool at a time, test whether it reduces failure on a specific task, and keep the rest of the prompt stable so you can see what changed. [1][3]

This is where discipline matters. Don't add tools because a stakeholder says, "We might need this later." Add them because the agent is repeatedly failing a real task that the new tool would solve.

My preferred expansion order is simple. First, improve descriptions. Second, merge or remove overlap. Third, add a new tool only if those fixes don't solve the issue. In a lot of teams, "we need more tools" is really "our current tools are badly named."

And yes, prompt hygiene matters here too. If you're iterating on agent instructions all day in Slack, your IDE, or docs, Rephrase's homepage shows the kind of shortcut-based workflow that makes this less painful.

The big idea is boring, which is why it works: give agents fewer choices, better descriptions, and clearer boundaries. Minimal viable tool sets don't limit intelligence. They remove noise so the model can use its intelligence where it actually matters.

References

Documentation & Research

Agentic Reasoning for Large Language Models - arXiv cs.AI (link)
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models - The Prompt Report / arXiv (link)
A developer's guide to production-ready AI agents - Google Cloud AI Blog (link)

Community Examples 4. What Breaks When Your Agent Has 100,000 Tools - Viktor blog / Hacker News (LLM) (link)

Frequently asked

How many tools should an AI agent have?

There is no universal number, but fewer is usually better at the start. Give an agent only the tools required to complete its core job, then add more only when logs show a real gap.

What makes a good agent tool description?

A good tool description is specific, narrow, and unambiguous. It should clearly say when to use the tool, what inputs it expects, and what output shape it returns.