CVOct 6, 2025

SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization

arXiv:2510.04961v15 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses the problem of slow and complex training for image tokenizers in generative models, offering a drop-in replacement for KL-VAE to build higher-quality and faster models, though it is incremental in improving existing diffusion decoder approaches.

The paper tackles the inefficiency and adversarial training requirements of diffusion decoders in image tokenization by introducing SSDD, a single-step diffusion decoder that improves reconstruction FID from 0.87 to 0.50 with 1.4x higher throughput and 3.8x faster sampling than previous methods.

Tokenizers are a key component of state-of-the-art generative image models, extracting the most important features from the signal while reducing data dimension and redundancy. Most current tokenizers are based on KL-regularized variational autoencoders (KL-VAE), trained with reconstruction, perceptual and adversarial losses. Diffusion decoders have been proposed as a more principled alternative to model the distribution over images conditioned on the latent. However, matching the performance of KL-VAE still requires adversarial losses, as well as a higher decoding time due to iterative sampling. To address these limitations, we introduce a new pixel diffusion decoder architecture for improved scaling and training stability, benefiting from transformer components and GAN-free training. We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder. This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses, reaching higher reconstruction quality and faster sampling than KL-VAE. In particular, SSDD improves reconstruction FID from $0.87$ to $0.50$ with $1.4\times$ higher throughput and preserve generation quality of DiTs with $3.8\times$ faster sampling. As such, SSDD can be used as a drop-in replacement for KL-VAE, and for building higher-quality and faster generative models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes