CVMay 20, 2025

VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation

arXiv:2505.14640v117 citationsh-index: 22Has Code
Originality Incremental advance
AI Analysis

This addresses the need for robust and realistic benchmarks to accurately assess long video understanding in large multimodal models, which is incremental as it improves evaluation methodology rather than model capabilities.

The paper tackles the problem of inflated performance in existing long video understanding benchmarks due to multiple-choice questions with guessing and prior biases, proposing VideoEval-Pro with open-ended short-answer questions that require full video understanding. Results show video LMMs experience over 25% performance drops on open-ended questions compared to MCQs, and VideoEval-Pro benefits more from increased input frames, offering a more realistic evaluation.

Large multimodal models (LMMs) have recently emerged as a powerful tool for long video understanding (LVU), prompting the development of standardized LVU benchmarks to evaluate their performance. However, our investigation reveals a rather sober lesson for existing LVU benchmarks. First, most existing benchmarks rely heavily on multiple-choice questions (MCQs), whose evaluation results are inflated due to the possibility of guessing the correct answer; Second, a significant portion of questions in these benchmarks have strong priors to allow models to answer directly without even reading the input video. For example, Gemini-1.5-Pro can achieve over 50\% accuracy given a random frame from a long video on Video-MME. We also observe that increasing the number of frames does not necessarily lead to improvement on existing benchmarks, which is counterintuitive. As a result, the validity and robustness of current LVU benchmarks are undermined, impeding a faithful assessment of LMMs' long-video understanding capability. To tackle this problem, we propose VideoEval-Pro, a realistic LVU benchmark containing questions with open-ended short-answer, which truly require understanding the entire video. VideoEval-Pro assesses both segment-level and full-video understanding through perception and reasoning tasks. By evaluating 21 proprietary and open-source video LMMs, we conclude the following findings: (1) video LMMs show drastic performance ($>$25\%) drops on open-ended questions compared with MCQs; (2) surprisingly, higher MCQ scores do not lead to higher open-ended scores on VideoEval-Pro; (3) compared to other MCQ benchmarks, VideoEval-Pro benefits more from increasing the number of input frames. Our results show that VideoEval-Pro offers a more realistic and reliable measure of long video understanding, providing a clearer view of progress in this domain.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes