CVCLFeb 22, 2024

Subobject-level Image Tokenization

arXiv:2402.14327v322 citationsh-index: 25ICML
Originality Incremental advance
AI Analysis

This addresses a bottleneck in image understanding for computer vision by improving tokenization efficiency and alignment with human visual perception.

The paper tackled the problem of patch-based image tokenization ignoring visual morphology by introducing subobject-level adaptive token segmentation, resulting in faster convergence, better generalization, and fewer visual tokens in VLMs.

Patch-based image tokenization ignores the morphology of the visual world, limiting effective and efficient learning of image understanding. Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation and explore several approaches, including superpixel, SAM, and a proposed Efficient and PanOptiC (EPOC) image tokenizer. Our EPOC combines boundary detection -- a simple task that can be handled well by a compact model -- with watershed segmentation, which inherently guarantees no pixels are left unsegmented. Intrinsic evaluations across 5 datasets demonstrate that EPOC's segmentation aligns well with human annotations of both object- and part-level visual morphology, producing more monosemantic tokens and offering substantial efficiency advantages. For extrinsic evaluation, we designed a token embedding that handles arbitrary-shaped tokens, and trained VLMs with different tokenizers on 4 datasets of object recognition and detailed captioning. The results reveal that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes