CLFeb 26

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

Sara Rosenthal, Yannis Katsis, Vraj Shah, Lihong He, Lucian Popa, Marina Danilevsky

IBM

arXiv:2602.23184v12.68 citationsh-index: 26Has Code

Originality Synthesis-oriented

AI Analysis

This benchmark addresses the problem of evaluating multi-turn RAG systems for researchers and developers, highlighting specific areas of weakness in current models.

This paper introduces MTRAG-UN, a benchmark of 666 tasks and over 2,800 conversation turns across 6 domains designed to evaluate multi-turn retrieval augmented generation (RAG) systems. Experiments with this benchmark reveal that current retrieval and generation models struggle with conversations involving unanswerable, underspecified, or non-standalone questions, and unclear responses.

We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models. We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora. Our experiments show that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses. Our benchmark is available at https://github.com/IBM/mt-rag-benchmark

View on arXiv PDF Code

Similar