CVJun 1, 2025

Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution

arXiv:2506.01037v16 citationsh-index: 2CVPR
Originality Incremental advance
AI Analysis

This addresses video quality enhancement for real-world applications, representing an incremental improvement over existing diffusion-based methods.

The paper tackles the problem of diffusion-based video super-resolution introducing artifacts due to randomness, proposing a noise-robust framework that achieves superior perceptual quality on real-world benchmark datasets.

Existing diffusion-based video super-resolution (VSR) methods are susceptible to introducing complex degradations and noticeable artifacts into high-resolution videos due to their inherent randomness. In this paper, we propose a noise-robust real-world VSR framework by incorporating self-supervised learning and Mamba into pre-trained latent diffusion models. To ensure content consistency across adjacent frames, we enhance the diffusion model with a global spatio-temporal attention mechanism using the Video State-Space block with a 3D Selective Scan module, which reinforces coherence at an affordable computational cost. To further reduce artifacts in generated details, we introduce a self-supervised ControlNet that leverages HR features as guidance and employs contrastive learning to extract degradation-insensitive features from LR videos. Finally, a three-stage training strategy based on a mixture of HR-LR videos is proposed to stabilize VSR training. The proposed Self-supervised ControlNet with Spatio-Temporal Continuous Mamba based VSR algorithm achieves superior perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes