AIApr 10

DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?

Young-Suk Lee, Ramon Fernandez Astudillo, Radu Florian

arXiv:2604.0925190.5

Predicted impact top 19% in AI · last 90 daysOriginality Incremental advance

AI Analysis

This addresses a blind spot in assessing real-world performance for AI agents that interleave browsing and computation, though it is incremental as it builds on existing benchmark concepts.

The authors tackled the problem of evaluating deep research agents that combine web browsing and multi-step computation by introducing DRBENCHER, a synthetic benchmark generator that enforces verifiability, complexity, difficulty, and diversity across five domains, resulting in only 20% answer accuracy for the strongest frontier model and 76% validity in human evaluation.

Deep research agents increasingly interleave web browsing with multi-step computation, yet existing benchmarks evaluate these capabilities in isolation, creating a blind spot in assessing real-world performance. We introduce DRBENCHER, a synthetic benchmark generator for questions that require both browsing and computation. It enforces four criteria: verifiability (gold answers are computed by executing parameterized code over knowledge-graph values), complexity (multi-hop entity identification, property retrieval, and domain-specific computation), difficulty (a two-stage verification cascade filters out questions solvable by the generating model), and diversity (a greedy max-min embedding filter maximizes coverage). These criteria are realized via a unified answer-first pipeline spanning five domains: biochemistry, financial, geophysical, security, and history. Human evaluation shows 76% validity (84% excluding stale data), with 35% of errors due to outdated knowledge-graph entries, highlighting an inherent limitation of systems that reason over evolving data. Automatic evaluation shows that the strongest frontier model achieves only 20% answer accuracy. Compared to manually constructed benchmarks (BrowseComp+, MATH-500, GPQA), DRBENCHER achieves the highest semantic diversity.

View on arXiv PDF

Similar