CVAIMay 11

Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice

arXiv:2605.1638451.3
AI Analysis

This work addresses the inefficiency of fixed-rate image tokenization for downstream vision tasks, offering a more compressed and accurate representation.

TaTok introduces a theoretically grounded adaptive image tokenization framework that uses global tokens and dynamic filtering to reduce redundancy and information loss, achieving a 1.3x gFID improvement and 8.7x inference speedup over existing methods.

Accurate and effective discrete image tokenization is crucial for long image sequence processing. However, current methods rigidly compress all content at a fixed rate, ignoring the variable information density of images and leading to either redundancy or information loss. Inspired by information entropy, we propose TaTok, a Theoretically grounded adaptive image Tokenization framework. We rigorously identify two key drawbacks in existing methods: information insufficiency when reconstructing images with patch tokens alone, and information redundancy among patch tokens. To address these, we introduce global tokens that model mutual information across patch tokens, and a Dynamic Token Filtering (DTF) algorithm based on cumulative conditional entropy to eliminate redundancy. Experiments confirm TaTok's state-of-the-art performance, delivering a 1.3x gFID improvement and 8.7x inference speedup. By allocating tokens according to information richness, TaTok enables more compressed yet accurate image tokenization, offering valuable insights for future research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes