MVQA-68K: A Multi-dimensional and Causally-annotated Dataset with Quality Interpretability for Video Assessment
This addresses the problem of selecting high-quality videos for pre-training video generation models, providing a more interpretable and comprehensive dataset for researchers and practitioners in video assessment.
The paper tackles the lack of comprehensiveness and interpretability in video quality assessment (VQA) by introducing MVQA-68K, a dataset with over 68,000 videos annotated across seven quality dimensions and chain-of-thought reasoning, which significantly enhances multimodal large language models (MLLMs) to achieve state-of-the-art results on internal and public benchmarks.
With the rapid advancement of video generation models such as Sora, video quality assessment (VQA) is becoming increasingly crucial for selecting high-quality videos from large-scale datasets used in pre-training. Traditional VQA methods, typically producing single numerical scores, often lack comprehensiveness and interpretability. To address these challenges, we introduce MVQA-68K, a novel multi-dimensional VQA dataset comprising over 68,000 carefully annotated videos, covering seven essential quality dimensions: overall aesthetics, camera movement, dynamic degree, texture detail, composition, visual quality, and factual consistency. Each annotation includes detailed chain-of-thought reasoning to facilitate interpretability and comprehensive understanding. Extensive experiments demonstrate that MVQA-68K significantly enhances the performance of various multimodal large language models (MLLMs) on the VQA task, achieving state-of-the-art results not only on our internal test set (Fig.1) but also on public benchmarks including LSVQ-test, LSVQ-1080p, and LIVE-VQC. Meantime, incorporating explicit reasoning process during VQA training substantially boosts the zero-shot generalization. Code and dataset will be available at github: https://github.com/Controller01-ai/MVQA-68K