CL AIMay 27, 2025

MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

Chenghao Yang, Yinbo Luo, Zhoufutu Wen, Qi Chu, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, Nenghai Yu

arXiv:2505.23810v29.63 citationsh-index: 21Has CodeEMNLP

Originality Synthesis-oriented

AI Analysis

This addresses the problem of evaluating LLM robustness in realistic dialogue scenarios for AI researchers, though it is incremental as it focuses on benchmarking rather than novel methods.

The authors tackled the lack of benchmarks for evaluating LLMs' robustness in long, complex multi-turn dialogues by introducing MARS-Bench, a benchmark derived from play-by-play commentary, which revealed that closed-source LLMs outperform open-source ones and explicit reasoning improves performance.

Large Language Models (\textbf{LLMs}), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs' robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present \textbf{MARS-Bench}, a \textbf{M}ulti-turn \textbf{A}thletic \textbf{R}eal-world \textbf{S}cenario Dialogue \textbf{Bench}mark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: Ultra Multi-turn, Interactive Multi-turn, and Cross-turn Tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs' robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenges when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs' performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction.

View on arXiv PDF

Similar