CVJan 6, 2025

CAT: Content-Adaptive Image Tokenization

arXiv:2501.03120v19 citationsh-index: 39
Originality Incremental advance
AI Analysis

This addresses inefficiencies in image tokenization for computer vision applications, though it is incremental as it builds on existing tokenization and diffusion methods.

The paper tackles the problem of fixed tokenization in image processing by introducing CAT, a content-adaptive tokenizer that dynamically adjusts representation capacity based on image complexity, resulting in improved FID scores and an 18.5% boost in inference throughput.

Most existing image tokenizers encode images into a fixed number of tokens or patches, overlooking the inherent variability in image complexity. To address this, we introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image, taking into account factors critical to human perception. Trained on images with diverse compression ratios, CAT demonstrates robust performance in image reconstruction. We also utilize its variable-length latent representations to train Diffusion Transformers (DiTs) for ImageNet generation. By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes