AIIRMAApr 24

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

arXiv:2604.2243691.21 citationsh-index: 3Has Code
AI Analysis

For researchers and developers building AI agent ecosystems, this benchmark exposes the limitations of description-based retrieval and provides a foundation for improving agent discovery.

AgentSearchBench introduces a benchmark with nearly 10,000 real-world agents to evaluate agent search under realistic conditions, revealing that semantic similarity alone is insufficient and that lightweight execution-aware signals significantly improve ranking quality.

The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at https://github.com/Bingo-W/AgentSearchBench.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes