CVJun 4, 2025

Images are Worth Variable Length of Representations

Lingjun Mao, Rodolfo Corona, Xin Liang, Wenhao Yan, Zineng Tang

arXiv:2506.03643v28.44 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

This addresses a bottleneck in computer vision for researchers and practitioners by enabling more efficient and expressive image representation, though it is an incremental improvement over existing autoencoder-based methods.

The paper tackles the inefficiency of fixed-length token sequences in vision encoders by proposing DOVE, a dynamic encoder that produces variable token counts per image, which reduces average tokens by a significant amount while maintaining high reconstruction quality and outperforming existing methods in downstream tasks.

Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, we propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens (i.e., continuous representation vectors) to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend DOVE with query-conditioned tokenization. By guiding the model to focus on query-relevant regions, it achieves more efficient and targeted semantic extraction. Our code and checkpoints are available at https://dove-encoder.github.io/dove-encoder.

View on arXiv PDF

Similar