Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model
This work addresses a fundamental bottleneck in autoregressive image generation for computer vision researchers, offering a significant performance improvement over existing methods.
The paper tackles the misalignment between bidirectional image tokenizations and unidirectional autoregressive models by introducing AliTok, a novel aligned tokenizer, resulting in a 177M parameter model achieving a gFID of 1.44 and IS of 319.5 on ImageNet-256, and scaling to 662M parameters to surpass state-of-the-art diffusion methods with 10x faster sampling.
Autoregressive image generation aims to predict the next token based on previous ones. However, this process is challenged by the bidirectional dependencies inherent in conventional image tokenizations, which creates a fundamental misalignment with the unidirectional nature of autoregressive models. To resolve this, we introduce AliTok, a novel Aligned Tokenizer that alters the dependency structure of the token sequence. AliTok employs a bidirectional encoder constrained by a causal decoder, a design that compels the encoder to produce a token sequence with both semantic richness and forward-dependency. Furthermore, by incorporating prefix tokens and employing a two-stage tokenizer training process to enhance reconstruction performance, AliTok achieves high fidelity and predictability simultaneously. Building upon AliTok, a standard decoder-only autoregressive model with just 177M parameters achieves a gFID of 1.44 and an IS of 319.5 on the ImageNet-256 benchmark. Scaling up to 662M parameters, our model reaches a gFID of 1.28, surpassing the state-of-the-art diffusion method while achieving a 10x faster sampling speed. The code and weights are available at https://github.com/ali-vilab/alitok.