An Ensemble Approach to Short-form Video Quality Assessment Using Multimodal LLM
This work addresses video quality assessment for short-form videos, which is important for platforms and users dealing with diverse content and artifacts, but it is incremental as it builds on existing MLLM and BVQA methods.
The paper tackled the challenge of assessing video quality for short-form videos by leveraging a pretrained multimodal large language model (MLLM) and combining it with existing blind video quality assessment (BVQA) models using an ensemble method, resulting in superior generalization performance.
The rise of short-form videos, characterized by diverse content, editing styles, and artifacts, poses substantial challenges for learning-based blind video quality assessment (BVQA) models. Multimodal large language models (MLLMs), renowned for their superior generalization capabilities, present a promising solution. This paper focuses on effectively leveraging a pretrained MLLM for short-form video quality assessment, regarding the impacts of pre-processing and response variability, and insights on combining the MLLM with BVQA models. We first investigated how frame pre-processing and sampling techniques influence the MLLM's performance. Then, we introduced a lightweight learning-based ensemble method that adaptively integrates predictions from the MLLM and state-of-the-art BVQA models. Our results demonstrated superior generalization performance with the proposed ensemble approach. Furthermore, the analysis of content-aware ensemble weights highlighted that some video characteristics are not fully represented by existing BVQA models, revealing potential directions to improve BVQA models further.