AIApr 23

Efficient Agent Evaluation via Diversity-Guided User Simulation

IBM
arXiv:2604.2148072.8h-index: 14
Predicted impact top 46% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For developers of LLM-based agents, DIVERT offers a more efficient and thorough evaluation method to uncover rare failure modes in multi-turn interactions.

DIVERT introduces a diversity-guided user simulation framework that reuses conversation prefixes and branches at critical decision points, discovering more failures per token and expanding the set of tasks with identified failures compared to standard linear rollouts.

Large language models (LLMs) are increasingly deployed as customer-facing agents, yet evaluating their reliability remains challenging due to stochastic, multi-turn interactions. Current evaluation protocols rely on linear Monte Carlo rollouts of complete agent-user conversations to estimate success. However, this approach is computationally inefficient, repeatedly regenerating identical early prefixes, and often fails to uncover deep failure modes that arise from rare user behaviors. We introduce DIVERT (Diversity-Induced Evaluation via Branching of Trajectories), an efficient, snapshot-based, coverage-guided user simulation framework for systematic exploration of agent-user interactions. DIVERT captures the full agent-environment state at critical decision points and resumes execution from these snapshots, enabling reuse of shared conversation prefixes and reducing redundant computation. From each junction, the framework branches using targeted, diversity-inducing user responses, allowing directed exploration of alternative interaction paths. By focusing evaluation on semantically diverse and underexplored trajectories, DIVERT improves both efficiency and coverage. Empirical results show that it discovers more failures per token compared to standard linear rollout protocols, while expanding the set of tasks on which failures are identified.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes