CVNov 4, 2025

Differentiable Hierarchical Visual Tokenization

arXiv:2511.02652v12 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses the limitation of fixed tokenization in Vision Transformers for computer vision applications, offering a retrofittable solution that is incremental in improving tokenization methods.

The paper tackles the problem of Vision Transformers using fixed patch tokens that ignore image structure by introducing a differentiable tokenizer that adapts to image content at pixel-level granularity, achieving competitive performance in classification and dense-prediction tasks while supporting raster-to-vector conversion.

Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes