DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations
This work solves the problem of robust video super-resolution for real-world applications with complex degradations, representing an incremental improvement by focusing on learning strategy over architectural complexity.
The paper tackles the challenge of applying diffusion models to video super-resolution (VSR) by addressing their failure on severely degraded videos, resulting in a method that excels in such scenarios with superior performance.
Diffusion models have demonstrated exceptional capabilities in image restoration, yet their application to video super-resolution (VSR) faces significant challenges in balancing fidelity with temporal consistency. Our evaluation reveals a critical gap: existing approaches consistently fail on severely degraded videos--precisely where diffusion models' generative capabilities are most needed. We identify that existing diffusion-based VSR methods struggle primarily because they face an overwhelming learning burden: simultaneously modeling complex degradation distributions, content representations, and temporal relationships with limited high-quality training data. To address this fundamental challenge, we present DiffVSR, featuring a Progressive Learning Strategy (PLS) that systematically decomposes this learning burden through staged training, enabling superior performance on complex degradations. Our framework additionally incorporates an Interweaved Latent Transition (ILT) technique that maintains competitive temporal consistency without additional training overhead. Experiments demonstrate that our approach excels in scenarios where competing methods struggle, particularly on severely degraded videos. Our work reveals that addressing the learning strategy, rather than focusing solely on architectural complexity, is the critical path toward robust real-world video super-resolution with diffusion models.