Video Quality Assessment Based on Swin TransformerV2 and Coarse to Fine Strategy
This work addresses video quality assessment for compressed videos, which is important for streaming and broadcasting applications, but it is incremental as it builds on existing transformer-based methods.
The paper tackles the problem of no-reference video quality assessment by introducing a model that combines an enhanced spatial perception module with a lightweight temporal fusion module, achieving state-of-the-art performance on benchmark datasets with improvements in accuracy metrics.
The objective of non-reference video quality assessment is to evaluate the quality of distorted video without access to reference high-definition references. In this study, we introduce an enhanced spatial perception module, pre-trained on multiple image quality assessment datasets, and a lightweight temporal fusion module to address the no-reference visual quality assessment (NR-VQA) task. This model implements Swin Transformer V2 as a local-level spatial feature extractor and fuses these multi-stage representations through a series of transformer layers. Furthermore, a temporal transformer is utilized for spatiotemporal feature fusion across the video. To accommodate compressed videos of varying bitrates, we incorporate a coarse-to-fine contrastive strategy to enrich the model's capability to discriminate features from videos of different bitrates. This is an expanded version of the one-page abstract.