CVJan 23

iFSQ: Improving FSQ for Image Generation with 1 Line of Code

arXiv:2601.17124v23 citationsh-index: 10Has Code
Originality Incremental advance
AI Analysis

This work addresses a critical flaw in FSQ for image generation, enabling better benchmarking and insights into model performance, though it is incremental as it builds on existing FSQ with a simple modification.

The paper tackles the problem of activation collapse in Finite Scalar Quantization (FSQ) for image generation by replacing the activation function with a distribution-matching mapping, termed iFSQ, which mathematically guarantees optimal bin utilization and reconstruction precision. Using iFSQ as a benchmark, they found that the optimal equilibrium between discrete and continuous representations is about 4 bits per dimension and that AR models converge faster initially, while diffusion models achieve a higher performance ceiling.

The field of image generation is currently bifurcated into autoregressive (AR) models operating on discrete tokens and diffusion models utilizing continuous latents. This divide, rooted in the distinction between VQ-VAEs and VAEs, hinders unified modeling and fair benchmarking. Finite Scalar Quantization (FSQ) offers a theoretical bridge, yet vanilla FSQ suffers from a critical flaw: its equal-interval quantization can cause activation collapse. This mismatch forces a trade-off between reconstruction fidelity and information efficiency. In this work, we resolve this dilemma by simply replacing the activation function in original FSQ with a distribution-matching mapping to enforce a uniform prior. Termed iFSQ, this simple strategy requires just one line of code yet mathematically guarantees both optimal bin utilization and reconstruction precision. Leveraging iFSQ as a controlled benchmark, we uncover two key insights: (1) The optimal equilibrium between discrete and continuous representations lies at approximately 4 bits per dimension. (2) Under identical reconstruction constraints, AR models exhibit rapid initial convergence, whereas diffusion models achieve a superior performance ceiling, suggesting that strict sequential ordering may limit the upper bounds of generation quality. Finally, we extend our analysis by adapting Representation Alignment (REPA) to AR models, yielding LlamaGen-REPA. Codes is available at https://github.com/Tencent-Hunyuan/iFSQ

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes