PRE-RELEASE — Some features are still in development and will be available soon.
June 1, 2026

Academic Research Just Confirmed It: Most Agent Tool Descriptions Are “Smelly”, And That’s Why Your Agents Struggle in Production

A new paper from Queen’s University researchers didn’t just theorize about flaky AI agents. They measured it.

They analyzed 856 tools across 103 real MCP servers (both official ones from big names and community-built ones). Using a structured rubric and an FM-based scanner, they found that 97.1% of tool descriptions contain at least one “smell.”

More than half (56%) fail to clearly state the tool’s purpose. Nearly 90% miss usage guidelines or limitations. Most leave parameters opaque.

These aren’t minor documentation nits. They are the exact reasons agents pick the wrong tool, pass bad arguments, lose context over long chains, or silently degrade until something breaks.

The researchers called these issues “smells.” We see them every day as the root causes of production pain.

AI agent tool descriptions being transformed from unclear, incomplete metadata into structured tool cards with purpose, parameters, limitations, and examples.

What the Paper Actually Found (and Why the Stats Matter)

The team built a clear scoring rubric around six components that good tool descriptions should have: Purpose, Guidelines, Limitations, Parameter Explanation, Length & Completeness, and Examples.

They then ran a multi-model “LLM-as-Jury” scanner across hundreds of real descriptions. The results were stark:

  • Only 2.9% of descriptions were clean across the key components.
  • Problems appeared equally in official and community servers — this is systemic, not a “some indie devs are sloppy” issue.
  • When they augmented the descriptions (adding the missing clarity), agents on the MCP-Universe benchmark saw a median 5.85 percentage point lift in task success rate and a 15.12% improvement in partial goal completion.

But here’s the honest part the paper also surfaces: richer descriptions increased execution steps by a median of 67.46%. In 16.67% of cases, performance actually regressed. More information helps — until it bloats context or introduces new ambiguity.

This isn’t hand-wavy. It’s measured, statistical validation that insufficient tool metadata is a widespread, measurable drag on agent reliability.

The “Smells” Map Directly to the Failures You Feel

If you’ve shipped agentic systems, these will sound painfully familiar:

  • Unclear Purpose + Missing Usage Guidelines → Tool selection errors. The agent doesn’t know when (or when not) to reach for a tool, so it guesses or picks the wrong one.
  • Unstated Limitations + Opaque Parameters → Context degradation and bad arguments. The agent (and the tools themselves) operate with incomplete pictures, leading to drift over multi-step workflows.
  • Underspecified or Incomplete descriptions → Cascading and silent failures. Small misunderstandings compound. Sometimes the agent keeps running while producing quietly wrong results until a user or downstream system notices.

The paper shows these aren’t rare edge cases. They are the default state of most MCP tool descriptions today.

That academic rigor is useful because it removes the “maybe it’s just our setup” doubt. The problem is real, it’s everywhere in the wild, and it directly undermines production reliability.

This Is Exactly Why We Built Trustabl Agent Analyzer

At Trustabl, we’ve been watching the same pattern: developers wire up promising agents, everything looks good in demos, then real workloads expose the gaps. The descriptions that came with the tools simply weren’t built for the demands of production agentic systems.

Agent Analyzer was designed to close that gap automatically.

Instead of leaving agents to work with thin, ambiguous metadata, Agent Analyzer enriches the picture with rich context that reaches both the agent and the tools themselves. It adds explicit guidance on what a tool is for,  and, just as importantly, what it is not for. It layers in observability and supports early pre-testing (including environments like OpenShell) so issues surface while they’re still cheap to fix.

The result? Agents that move from “mostly works in testing” to production-ready with far less manual debugging.

We didn’t invent the need for better descriptions. The Queen’s University team just proved how widespread and costly the current state is. Agent Analyzer turns that insight into something you can apply to your own tools and workflows, almost completely automatically, and without ripping out your existing development process.

The Practical Takeaway

The paper also shows something important for builders: simply making descriptions longer isn’t always the win. Smart, targeted enrichment often delivers most of the benefit with less overhead. That aligns with how Agent Analyzer works,  focused richness rather than blanket verbosity.

If you’re trying to ship reliable agentic systems (or upskill into roles that actually require shipping them), the bottleneck isn’t usually the model. It’s the quality of the instructions and context those models receive about the tools they can use.

The research confirms the pain is real and systemic. Agent Analyzer exists to make fixing it practical and automatic.

Head over to trustabl.ai and see what Agent Analyzer can do with the tools you’re already using. Turn descriptions that currently create friction into ones that actually support smooth, reliable production behavior.

The data says the problem is everywhere. Now you have a straightforward way to address it.


Related Blogs