Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice
This work addresses the inefficiency of fixed-rate image tokenization for downstream vision tasks, offering a more compressed and accurate representation.
TaTok introduces a theoretically grounded adaptive image tokenization framework that uses global tokens and dynamic filtering to reduce redundancy and information loss, achieving a 1.3x gFID improvement and 8.7x inference speedup over existing methods.
Accurate and effective discrete image tokenization is crucial for long image sequence processing. However, current methods rigidly compress all content at a fixed rate, ignoring the variable information density of images and leading to either redundancy or information loss. Inspired by information entropy, we propose TaTok, a Theoretically grounded adaptive image Tokenization framework. We rigorously identify two key drawbacks in existing methods: information insufficiency when reconstructing images with patch tokens alone, and information redundancy among patch tokens. To address these, we introduce global tokens that model mutual information across patch tokens, and a Dynamic Token Filtering (DTF) algorithm based on cumulative conditional entropy to eliminate redundancy. Experiments confirm TaTok's state-of-the-art performance, delivering a 1.3x gFID improvement and 8.7x inference speedup. By allocating tokens according to information richness, TaTok enables more compressed yet accurate image tokenization, offering valuable insights for future research.