CVJan 6, 2025

CAT: Content-Adaptive Image Tokenization

Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Ishan Misra, Luke Zettlemoyer, Lili Yu, Chunting Zhou

arXiv:2501.03120v119.012 citationsh-index: 83

Originality Incremental advance

AI Analysis

This addresses inefficiencies in image tokenization for computer vision applications, though it is incremental as it builds on existing tokenization and diffusion methods.

The paper tackles the problem of fixed tokenization in image processing by introducing CAT, a content-adaptive tokenizer that dynamically adjusts representation capacity based on image complexity, resulting in improved FID scores and an 18.5% boost in inference throughput.

Most existing image tokenizers encode images into a fixed number of tokens or patches, overlooking the inherent variability in image complexity. To address this, we introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image, taking into account factors critical to human perception. Trained on images with diverse compression ratios, CAT demonstrates robust performance in image reconstruction. We also utilize its variable-length latent representations to train Diffusion Transformers (DiTs) for ImageNet generation. By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.

View on arXiv PDF

Similar