CV AIOct 20, 2025

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Yaning Pan, Zekun Wang, Qianqian Xie, Yongqian Wen, Yuanxing Zhang, Guohui Zhang, Haoxuan Hu, Zhiyu Pan, Yibing Huang, Zhidong Gan, Yonghong Lin, An Ping

arXiv:2510.17722v114.45 citationsh-index: 9Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of limited real-world applicability for researchers and developers by providing a benchmark for multi-turn video dialogues, though it is incremental as it extends existing evaluation frameworks.

The authors tackled the lack of evaluation benchmarks for multimodal large language models in multi-turn dialogues by introducing MT-Video-Bench, a holistic video understanding benchmark with 987 curated dialogues, revealing significant performance discrepancies among state-of-the-art models.

The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.

View on arXiv PDF

Similar