CVJul 27, 2025

Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training

arXiv:2507.20291v115 citationsh-index: 8Has Code
Originality Incremental advance
AI Analysis

This work addresses a critical limitation in real-world image super-resolution for applications requiring high detail, such as text or texture recovery, though it is incremental as it builds on existing stable diffusion frameworks.

The paper tackles the problem of poor fine-structure preservation in real-world image super-resolution using stable diffusion models, by proposing a Transfer VAE Training strategy that reduces downsampling from 8× to 4× and optimizes network architectures, resulting in significant improvements in detail preservation with fewer FLOPs than state-of-the-art methods.

Impressive results on real-world image super-resolution (Real-ISR) have been achieved by employing pre-trained stable diffusion (SD) models. However, one critical issue of such methods lies in their poor reconstruction of image fine structures, such as small characters and textures, due to the aggressive resolution reduction of the VAE (eg., 8$\times$ downsampling) in the SD model. One solution is to employ a VAE with a lower downsampling rate for diffusion; however, adapting its latent features with the pre-trained UNet while mitigating the increased computational cost poses new challenges. To address these issues, we propose a Transfer VAE Training (TVT) strategy to transfer the 8$\times$ downsampled VAE into a 4$\times$ one while adapting to the pre-trained UNet. Specifically, we first train a 4$\times$ decoder based on the output features of the original VAE encoder, then train a 4$\times$ encoder while keeping the newly trained decoder fixed. Such a TVT strategy aligns the new encoder-decoder pair with the original VAE latent space while enhancing image fine details. Additionally, we introduce a compact VAE and compute-efficient UNet by optimizing their network architectures, reducing the computational cost while capturing high-resolution fine-scale features. Experimental results demonstrate that our TVT method significantly improves fine-structure preservation, which is often compromised by other SD-based methods, while requiring fewer FLOPs than state-of-the-art one-step diffusion models. The official code can be found at https://github.com/Joyies/TVT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes