Your agent just called the same tool 20 times
Your agent just called the same tool 20 times
Watch an agent work through a complex task and you'll see it. The same database lookup, three times in a row. The same search query, once per reasoning iteration. The same API call, identical arguments, identical results - billed fresh every single time.
This isn't a bug. It's how reasoning loops work. The model doesn't remember that it already called get_project with {"key": "alpha"} two turns ago. So it calls it again. And again. And your external API meter keeps ticking.
The math is brutal
A reasoning agent working through a multi-step problem makes 20-50 tool calls per task. In our benchmarks, 80% of those calls are duplicates - same tool, same arguments, same result. That's 16 out of 20 calls that accomplish nothing except burning latency and API quota.
Each call hits your MCP server, which hits your database or external API, which takes 50-200ms of wall time. Multiply by the number of agents, the number of tasks per hour, and you're looking at thousands of wasted calls per day.
The worst part: the model gets identical text back every time. It doesn't know or care whether the result came from a fresh API call or from memory.
The fix: three lines of YAML
# researcher-agent.yaml
tools:
mcp:
- name: search
url: http://search-mcp.tools.svc:8080
cache:
enabled: true
ttlSeconds: 600
excludeTools:
- create_bookmark
That's it. Every tool on the search MCP server now caches results for 10 minutes. Same arguments, same tool name = cache hit. The agent gets the exact same response in nanoseconds instead of milliseconds. The MCP server never sees the duplicate call.
excludeTools is the safety valve. Tools that create, update, or delete - anything non-idempotent - go on this list and always hit the real server.
What actually happens
Here's a real benchmark. 20 tool calls, 4 unique argument combinations, 50ms simulated MCP latency:
| Metric | Without cache | With cache |
|---|---|---|
| MCP server hits | 20 | 4 |
| Total latency | 1,029ms | 205ms |
| Avg per call | 51ms | 10ms |
| Savings | - | 80% fewer calls, 80% faster |
On a cache hit, the response returns in 42 nanoseconds. Not milliseconds. Nanoseconds. The model sees identical output:
{
"key": "project-alpha",
"value": { "budget": 50000, "status": "active" },
"created_at": "2026-04-27T10:27:55Z"
}
Same string, whether it came from the MCP server or from memory. The LLM cannot tell the difference.
How it works under the hood
The cache key is sha256(tool_name + canonicalized_json_args). Arguments are re-serialized before hashing, so {"a":1, "b":2} and {"b":2,"a":1} hit the same cache entry. Whitespace differences don't matter.
The cache lives in-memory inside the agent pod. No Redis, no external dependencies, no infrastructure to deploy. It works with any LLM provider - Anthropic, OpenAI, Ollama, vLLM, whatever you run. The cache sits between the runner and the MCP dispatch, completely provider-agnostic.
Cache entries expire after ttlSeconds. Default is 300 (5 minutes) - long enough to cover a reasoning loop, short enough that stale data isn't a real risk.
What doesn't get cached
Three things bypass the cache unconditionally:
- Tools on the exclude list.
store_result,send_email,create_ticket- anything that changes state. - Failed calls. If the MCP server returns an error, the error is not cached. The next call retries fresh.
- Servers without
cache.enabled: true. Default is off. You opt in per server, not globally.
Progressive example
Start simple - cache everything on one server:
tools:
mcp:
- name: knowledge-base
url: http://kb-mcp.tools.svc:8080
cache:
enabled: true
Defaults kick in: 300s TTL, no excludes. Every tool on that server gets cached.
Get specific - different TTLs per server, exclude writes:
tools:
mcp:
- name: search
url: http://search.tools.svc:8080
cache:
enabled: true
ttlSeconds: 600
- name: crm
url: http://crm-mcp.tools.svc:8080
cache:
enabled: true
ttlSeconds: 60
excludeTools:
- update_contact
- create_deal
- send_message
Search results are stable for 10 minutes. CRM lookups get a shorter window because the data changes more often. Write operations always go through.
When not to use it
If your tools return different results for the same input (random sampling, current timestamp, live sensor data), caching will serve stale values. Don't enable it for those servers.
If your reasoning loops genuinely need fresh data on every iteration (polling for completion, watching for state changes), the TTL needs to be shorter than your poll interval - or just leave caching off for that server.
The takeaway
80% of your agent's tool calls are waste. Three lines of YAML eliminate them. The model gets identical results, your MCP servers get 80% less traffic, and your agent finishes faster.
Try it: add cache.enabled: true to one MCP server and watch your tool call metrics drop.