CV AIMay 11

Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice

Xiusheng Huang, Xin Jiang, Jun Zhao, Kang Liu, Yequan Wang

arXiv:2605.1638451.3

AI Analysis

This work addresses the inefficiency of fixed-rate image tokenization for downstream vision tasks, offering a more compressed and accurate representation.

TaTok introduces a theoretically grounded adaptive image tokenization framework that uses global tokens and dynamic filtering to reduce redundancy and information loss, achieving a 1.3x gFID improvement and 8.7x inference speedup over existing methods.

Accurate and effective discrete image tokenization is crucial for long image sequence processing. However, current methods rigidly compress all content at a fixed rate, ignoring the variable information density of images and leading to either redundancy or information loss. Inspired by information entropy, we propose TaTok, a Theoretically grounded adaptive image Tokenization framework. We rigorously identify two key drawbacks in existing methods: information insufficiency when reconstructing images with patch tokens alone, and information redundancy among patch tokens. To address these, we introduce global tokens that model mutual information across patch tokens, and a Dynamic Token Filtering (DTF) algorithm based on cumulative conditional entropy to eliminate redundancy. Experiments confirm TaTok's state-of-the-art performance, delivering a 1.3x gFID improvement and 8.7x inference speedup. By allocating tokens according to information richness, TaTok enables more compressed yet accurate image tokenization, offering valuable insights for future research.

View on arXiv PDF

Similar