CVApr 16, 2025

DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment

arXiv:2504.11733v36 citationsh-index: 33IEEE transactions on circuits and systems for video technology (Print)
Originality Incremental advance
AI Analysis

This work addresses video quality assessment for applications like streaming and surveillance, but it is incremental as it builds on existing dual-stream and CLIP-based methods.

The paper tackled the problem of blind video quality assessment by addressing CLIP's inability to capture temporal and motion information in videos, proposing DVLTA-VQA which decouples and integrates CLIP components into a dual-stream framework, achieving state-of-the-art performance on multiple benchmarks with improvements of up to 0.15 in correlation metrics.

Inspired by the dual-stream theory of the human visual system (HVS) - where the ventral stream is responsible for object recognition and detail analysis, while the dorsal stream focuses on spatial relationships and motion perception - an increasing number of video quality assessment (VQA) works built upon this framework are proposed. Recent advancements in large multi-modal models, notably Contrastive Language-Image Pretraining (CLIP), have motivated researchers to incorporate CLIP into dual-stream-based VQA methods. This integration aims to harness the model's superior semantic understanding capabilities to replicate the object recognition and detail analysis in ventral stream, as well as spatial relationship analysis in dorsal stream. However, CLIP is originally designed for images and lacks the ability to capture temporal and motion information inherent in videos. To address the limitation, this paper propose a Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment (DVLTA-VQA), which decouples CLIP's visual and textual components, and integrates them into different stages of the NR-VQA pipeline. Specifically, a Video-Based Temporal CLIP module is proposed to explicitly model temporal dynamics and enhance motion perception, aligning with the dorsal stream. Additionally, a Temporal Context Module is developed to refine inter-frame dependencies, further improving motion modeling. On the ventral stream side, a Basic Visual Feature Extraction Module is employed to strengthen detail analysis. Finally, a text-guided adaptive fusion strategy is proposed to enable dynamic weighting of features, facilitating more effective integration of spatial and temporal information.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes