CVAILGFeb 5, 2025

Masked Autoencoders Are Effective Tokenizers for Diffusion Models

arXiv:2502.03444v277 citationsh-index: 10ICML
Originality Highly original
AI Analysis

This work addresses the challenge of efficient and high-quality image generation for AI and computer vision applications, presenting a novel approach that is not incremental but offers substantial practical improvements.

The paper tackled the problem of improving latent diffusion models for image synthesis by analyzing the latent space structure, finding that fewer Gaussian Mixture modes and more discriminative features enhance generation quality, and proposed MAETok, an autoencoder using mask modeling, which achieved a gFID of 1.69 on ImageNet with 128 tokens, offering 76x faster training and 31x higher inference throughput for 512x512 generation.

Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76x faster training and 31x higher inference throughput for 512x512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models are released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes