CLJan 30

DETOUR: An Interactive Benchmark for Dual-Agent Search and Reasoning

arXiv:2602.00352v1h-index: 6
Originality Incremental advance
AI Analysis

This addresses the need for more realistic benchmarks in conversational AI for researchers, though it is incremental as it extends existing single-turn evaluations.

The authors tackled the problem of evaluating agent capabilities in multi-turn tip-of-the-tongue search processes by introducing DETOUR, a dual-agent benchmark with 1,011 prompts, and found that current state-of-the-art models achieve only 36% accuracy across all modalities.

When recalling information in conversation, people often arrive at the recollection after multiple turns. However, existing benchmarks for evaluating agent capabilities in such tip-of-the-tongue search processes are restricted to single-turn settings. To more realistically simulate tip-of-the-tongue search, we introduce Dual-agent based Evaluation Through Obscure Under-specified Retrieval (DETOUR), a dual-agent evaluation benchmark containing 1,011 prompts. The benchmark design involves a Primary Agent, which is the subject of evaluation, tasked with identifying the recollected entity through querying a Memory Agent that is held consistent across evaluations. Our results indicate that current state-of-the-art models still struggle with our benchmark, only achieving 36% accuracy when evaluated on all modalities (text, image, audio, and video), highlighting the importance of enhancing capabilities in underspecified scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes