Taming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation

Bowen Zheng, Weijian Luo, Guang Yang, Colin Zhang, Tianyang Hu

arXiv:2605.0620773.0

AI Analysis

This work addresses a fundamental inefficiency in autoregressive visual generation for researchers working on image tokenization and generation, offering a simple yet effective modification that significantly improves performance without additional training techniques.

The paper identifies the 'Entropy Cliff' phenomenon in discrete visual tokenizers, where conditional entropy drops rapidly along the sequence, leading to a memorization problem. They propose Variable Codebook Size Quantization (VCQ) which monotonically increases codebook size along the sequence, reducing gFID from 27.98 to 14.80 on ImageNet 256×256 without CFG, and achieving gFID 1.71 with 684M parameters.

Most discrete visual tokenizers rely on a default design: every position in the sequence shares the same codebook. Researchers try to scale the codebook size $K$ to get better reconstruction performance. Such a constant-codebook design hits a fundamental information-theoretic limit. We observe that the per-position conditional entropy of the training set decays so quickly along the sequence that, after a few positions, the conditional distribution becomes essentially deterministic. On ImageNet with $K=16384$, this happens within only 2 out of 256 positions, turning the remaining 254 into a memorization problem. We call this phenomenon the Entropy Cliff and formalize it with a simple expression: $t^{*} = \lceil \log_2 N / \log_2 K \rceil$. Interestingly, this phenomenon is not observed in language, as its natural structure keeps the effective entropy per position well below the codebook capacity. To address this, we propose Variable Codebook Size Quantization (VCQ), where the codebook size $K_t$ grows monotonically along the sequence from $K_{\min}=2$ to $K_{\max}$, leaving the loss function, parameter count, and AR training procedure unchanged. With a vanilla autoregressive Transformer and standard next-token prediction, a base version of VCQ reduces gFID w/o CFG from 27.98 to 14.80 on ImageNet $256\times256$ over the baseline. Scaled up, it reaches gFID 1.71 with 684M autoregressive parameters, without any extra training techniques such as semantic regularization or causal alignment. The extreme information bottleneck at $K_{\min}=2$ naturally induces a coarse-to-fine semantic hierarchy: a linear probe on only the first 10 tokens reaches 43.8% top-1 accuracy on ImageNet, compared to 27.1% for uniform codebooks. Ultimately, these results show that what matters is not only the total capacity of the codebook, but also how that capacity is distributed and organized.

View on arXiv PDF

Similar