April 29, 20264 min read

Your agent just called the same tool 20 times

cost-controlagentsmcpdeep-dive

Your agent just called the same tool 20 times

Watch an agent work through a complex task and you'll see it. The same database lookup, three times in a row. The same search query, once per reasoning iteration. The same API call, identical arguments, identical results - billed fresh every single time.

This isn't a bug. It's how reasoning loops work. The model doesn't remember that it already called get_project with {"key": "alpha"} two turns ago. So it calls it again. And again. And your external API meter keeps ticking.

The math is brutal

A reasoning agent working through a multi-step problem makes 20-50 tool calls per task. In our benchmarks, 80% of those calls are duplicates - same tool, same arguments, same result. That's 16 out of 20 calls that accomplish nothing except burning latency and API quota.

Each call hits your MCP server, which hits your database or external API, which takes 50-200ms of wall time. Multiply by the number of agents, the number of tasks per hour, and you're looking at thousands of wasted calls per day.

The worst part: the model gets identical text back every time. It doesn't know or care whether the result came from a fresh API call or from memory.

The fix: three lines of YAML

# researcher-agent.yaml
tools:
  mcp:
    - name: search
      url: http://search-mcp.tools.svc:8080
      cache:
        enabled: true
        ttlSeconds: 600
        excludeTools:
          - create_bookmark

That's it. Every tool on the search MCP server now caches results for 10 minutes. Same arguments, same tool name = cache hit. The agent gets the exact same response in nanoseconds instead of milliseconds. The MCP server never sees the duplicate call.

excludeTools is the safety valve. Tools that create, update, or delete - anything non-idempotent - go on this list and always hit the real server.

What actually happens

Here's a real benchmark. 20 tool calls, 4 unique argument combinations, 50ms simulated MCP latency:

Metric	Without cache	With cache
MCP server hits	20	4
Total latency	1,029ms	205ms
Avg per call	51ms	10ms
Savings	-	80% fewer calls, 80% faster

On a cache hit, the response returns in 42 nanoseconds. Not milliseconds. Nanoseconds. The model sees identical output:

{
  "key": "project-alpha",
  "value": { "budget": 50000, "status": "active" },
  "created_at": "2026-04-27T10:27:55Z"
}

Same string, whether it came from the MCP server or from memory. The LLM cannot tell the difference.

How it works under the hood

The cache key is sha256(tool_name + canonicalized_json_args). Arguments are re-serialized before hashing, so {"a":1, "b":2} and {"b":2,"a":1} hit the same cache entry. Whitespace differences don't matter.

The cache lives in-memory inside the agent pod. No Redis, no external dependencies, no infrastructure to deploy. It works with any LLM provider - Anthropic, OpenAI, Ollama, vLLM, whatever you run. The cache sits between the runner and the MCP dispatch, completely provider-agnostic.

Cache entries expire after ttlSeconds. Default is 300 (5 minutes) - long enough to cover a reasoning loop, short enough that stale data isn't a real risk.

What doesn't get cached

Three things bypass the cache unconditionally:

Tools on the exclude list. store_result, send_email, create_ticket - anything that changes state.
Failed calls. If the MCP server returns an error, the error is not cached. The next call retries fresh.
Servers without cache.enabled: true. Default is off. You opt in per server, not globally.

Progressive example

Start simple - cache everything on one server:

tools:
  mcp:
    - name: knowledge-base
      url: http://kb-mcp.tools.svc:8080
      cache:
        enabled: true

Defaults kick in: 300s TTL, no excludes. Every tool on that server gets cached.

Get specific - different TTLs per server, exclude writes:

tools:
  mcp:
    - name: search
      url: http://search.tools.svc:8080
      cache:
        enabled: true
        ttlSeconds: 600
    - name: crm
      url: http://crm-mcp.tools.svc:8080
      cache:
        enabled: true
        ttlSeconds: 60
        excludeTools:
          - update_contact
          - create_deal
          - send_message

Search results are stable for 10 minutes. CRM lookups get a shorter window because the data changes more often. Write operations always go through.

When not to use it

If your tools return different results for the same input (random sampling, current timestamp, live sensor data), caching will serve stale values. Don't enable it for those servers.

If your reasoning loops genuinely need fresh data on every iteration (polling for completion, watching for state changes), the TTL needs to be shorter than your poll interval - or just leave caching off for that server.

The takeaway

80% of your agent's tool calls are waste. Three lines of YAML eliminate them. The model gets identical results, your MCP servers get 80% less traffic, and your agent finishes faster.

Try it: add cache.enabled: true to one MCP server and watch your tool call metrics drop.

Was this useful?