CVMar 23

DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment

arXiv:2603.2212588.61 citationsh-index: 8
Predicted impact top 25% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses efficiency bottlenecks in high-resolution image generation for users of diffusion models, offering a plug-in solution that is incremental but practical.

The paper tackles the problem of reducing token count for efficient training and inference in latent diffusion models by proposing DA-VAE, a method that increases compression ratio with lightweight adaptation, enabling 1024x1024 image generation with 4x fewer tokens and a 6x speedup for 2048x2048 generation while preserving quality.

Reducing token count is crucial for efficient training and inference of latent diffusion models, especially at high resolution. A common strategy is to build high-compression image tokenizers with more channels per token. However, when trained only for reconstruction, high-dimensional latent spaces often lose meaningful structure, making diffusion training harder. Existing methods address this with extra objectives such as semantic alignment or selective dropout, but usually require costly diffusion retraining. Pretrained diffusion models, however, already exhibit a structured, lower-dimensional latent space; thus, a simpler idea is to expand the latent dimensionality while preserving this structure. We therefore propose \textbf{D}etail-\textbf{A}ligned VAE, which increases the compression ratio of a pretrained VAE with only lightweight adaptation of the pretrained diffusion backbone. DA-VAE uses an explicit latent layout: the first $C$ channels come directly from the pretrained VAE at a base resolution, while an additional $D$ channels encode higher-resolution details. A simple detail-alignment mechanism encourages the expanded latent space to retain the structure of the original one. With a warm-start fine-tuning strategy, our method enables $1024 \times 1024$ image generation with Stable Diffusion 3.5 using only $32 \times 32$ tokens, $4\times$ fewer than the original model, within 5 H100-days. It further unlocks $2048 \times 2048$ generation with SD3.5, achieving a $6\times$ speedup while preserving image quality. We also validate the method and its design choices quantitatively on ImageNet.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes