CLMay 20

MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks

Junhao Ruan, Abudukeyumu Abudula, Bei Li, Yongjing Yin, Xinyu Liu, Kechen Jiao, Xin Chen, Jingang Wang, Xunliang Cai, Tong Xiao, Jingbo Zhu

arXiv:2605.2072993.9Has Code

AI Analysis

This work addresses the need for cost-effective, high-quality conversational retrieval benchmarks for RAG systems, offering a unified framework for auditing and synthesis.

MTR-Suite introduces a framework for evaluating and synthesizing conversational retrieval benchmarks, featuring an LLM-based auditor (MTR-Eval) and a multi-agent pipeline (MTR-Pipeline) that generates high-fidelity dialogues at 1/400th human cost, resulting in a benchmark (MTR-Bench) with superior discriminative power.

Accurate evaluation of conversational retrieval is pivotal for advancing Retrieval-Augmented Generation (RAG) systems. However, existing conversational retrieval benchmarks suffer from costly, sparse human annotation or rigid, unnatural automated heuristics. To address these challenges, we introduce MTR-Suite, a unified framework for auditing, synthesizing, and benchmarking retrieval. It features: (1) MTR-Eval, an LLM-based auditor quantifying alignment gaps in previous benchmarks; (2) MTR-Pipeline, a multi-agent system using greedy traversal clustering to generate high-fidelity dialogues at 1/400th human cost; and (3) MTR-Bench, a rigorous general-domain benchmark. MTR-Bench mimics production-style challenges (hard topic switching, verbosity), offering superior discriminative power. We make our code and data publicly available to facilitate future research at https://github.com/rangehow/mtr-suite.

View on arXiv PDF Code

Similar