CVLGMay 14

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

arXiv:2605.1519679.2
Predicted impact top 30% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the architectural asymmetry in latent diffusion models where decoders lack conditioning, leading to detail loss, and provides a plug-and-play solution that improves multiple video generation tasks without fine-tuning.

RefDecoder enhances video generation by conditioning the VAE decoder on a reference image, achieving up to +2.1dB PSNR improvement over unconditional baselines on reconstruction benchmarks and improving subject consistency, background consistency, and overall quality on the VBench I2V benchmark.

Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across several distinct decoder backbones (e.g., Wan 2.1 and VideoVAE+), achieving up to +2.1dB PSNR over the unconditional baselines on the Inter4K, WebVid, and Large Motion reconstruction benchmarks. Notably, RefDecoder can be directly swapped into existing video generation systems without additional fine-tuning, and we report across-the-board improvements in subject consistency, background consistency, and overall quality scores on the VBench I2V benchmark. Beyond I2V, RefDecoder generalizes well to a wide range of visual generation tasks such as style transfer and video editing refinement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes