Do LLM Agents Know How to Ground, Recover, and Assess? A Benchmark for Epistemic Competence in Information-Seeking Agents
This addresses the need for more granular evaluation of epistemic competence in information-seeking agents, which is incremental as it builds on existing RL-trained LLM agents by adding step-level analysis.
The authors tackled the problem of evaluating how LLM search agents reason with external evidence in open-domain QA by introducing SeekBench, a benchmark with 190 expert-annotated traces and over 1,800 response steps to analyze grounding, recovery, and calibration.
Recent work has explored training Large Language Model (LLM) search agents with reinforcement learning (RL) for open-domain question answering (QA). However, most evaluations focus solely on final answer accuracy, overlooking how these agents reason with and act on external evidence. We introduce SeekBench, the first benchmark for evaluating the \textit{epistemic competence} of LLM search agents through step-level analysis of their response traces. SeekBench comprises 190 expert-annotated traces with over 1,800 response steps generated by LLM search agents, each enriched with evidence annotations for granular analysis of whether agents (1) generate reasoning steps grounded in observed evidence, (2) adaptively reformulate searches to recover from low-quality results, and (3) have proper calibration to correctly assess whether the current evidence is sufficient for providing an answer.