AICLFeb 22

Benchmark Test-Time Scaling of General LLM Agents

arXiv:2602.18998v17 citationsh-index: 5Has Code
Originality Incremental advance
AI Analysis

This addresses the need for realistic evaluation of general LLM agents, revealing critical limitations in current scaling approaches, though it is incremental as it builds on existing benchmark concepts.

The authors tackled the problem of evaluating general-purpose LLM agents by introducing General AgentBench, a unified benchmark across search, coding, reasoning, and tool-use domains, and found that ten leading agents showed substantial performance degradation in this setting, with test-time scaling methods failing to improve performance due to context ceiling and verification gap limitations.

LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling. Code is publicly available at https://github.com/cxcscmu/General-AgentBench.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes