FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding
This work addresses the challenge of fine-grained video motion comprehension for multimodal large language model developers, providing tools to benchmark and enhance models, though it is incremental as it builds on existing benchmarking efforts.
The authors tackled the problem of fine-grained motion understanding in videos by introducing FAVOR-Bench, a benchmark with 1,776 videos and 8,184 multiple-choice questions, and FAVOR-Train, a dataset of 17,152 videos, which revealed significant limitations in 21 state-of-the-art MLLMs and improved performance when finetuning Qwen2.5-VL on motion-related tasks.
Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in video content understanding but still struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, comprising 1,776 videos with structured manual annotations of various motions. Our benchmark includes both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8,184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we develop both a novel cost-efficient LLM-free and a GPT-assisted caption assessment method, where the former can enhance benchmarking interpretability and reproducibility. Comprehensive experiments with 21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset consisting of 17,152 videos with fine-grained motion annotations. The results of finetuning Qwen2.5-VL on FAVOR-Train yield consistent improvements on motion-related tasks of TVBench, MotionBench and our FAVOR-Bench. Comprehensive assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools to the community for developing more powerful video understanding models. Project page: \href{https://favor-bench.github.io/}{https://favor-bench.github.io/}.