CVLGSep 27, 2023

Finite Scalar Quantization: VQ-VAE Made Simple

arXiv:2309.15505v2467 citationsh-index: 36
Originality Incremental advance
AI Analysis

This simplifies discrete representation learning for researchers and practitioners in generative modeling and computer vision, though it is incremental as it builds on existing VQ-VAE frameworks.

The authors tackled the complexity of vector quantization in VQ-VAEs by proposing finite scalar quantization (FSQ), a simpler method that projects representations to a few dimensions and quantizes each to fixed values, achieving competitive performance in tasks like image generation and depth estimation without issues like codebook collapse.

We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes