CV CLFeb 22, 2024

Subobject-level Image Tokenization

Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, Pascale Fung

arXiv:2402.14327v318.622 citationsh-index: 25Has CodeICML

Originality Incremental advance

AI Analysis

This addresses a bottleneck in image understanding for computer vision by improving tokenization efficiency and alignment with human visual perception.

The paper tackled the problem of patch-based image tokenization ignoring visual morphology by introducing subobject-level adaptive token segmentation, resulting in faster convergence, better generalization, and fewer visual tokens in VLMs.

Patch-based image tokenization ignores the morphology of the visual world, limiting effective and efficient learning of image understanding. Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation and explore several approaches, including superpixel, SAM, and a proposed Efficient and PanOptiC (EPOC) image tokenizer. Our EPOC combines boundary detection -- a simple task that can be handled well by a compact model -- with watershed segmentation, which inherently guarantees no pixels are left unsegmented. Intrinsic evaluations across 5 datasets demonstrate that EPOC's segmentation aligns well with human annotations of both object- and part-level visual morphology, producing more monosemantic tokens and offering substantial efficiency advantages. For extrinsic evaluation, we designed a token embedding that handles arbitrary-shaped tokens, and trained VLMs with different tokenizers on 4 datasets of object recognition and detailed captioning. The results reveal that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.

View on arXiv PDF Code

Similar