CVApr 5, 2025

Window Token Concatenation for Efficient Visual Large Language Models

Yifan Li, Wentao Bao, Botao Ye, Zhen Tan, Tianlong Chen, Huan Liu, Yu Kong

arXiv:2504.04024v110.23 citationsh-index: 15Has Code2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Originality Incremental advance

AI Analysis

This work addresses efficiency issues in VLLMs for computer vision applications, representing an incremental improvement over prior token reduction techniques.

The paper tackles the problem of reducing visual tokens in Visual Large Language Models (VLLMs) to improve efficiency, proposing Window Token Concatenation (WiCo) and its enhanced version WiCo+, which achieve better performance on coarse- and fine-grained visual understanding tasks compared to existing token reduction methods.

To effectively reduce the visual tokens in Visual Large Language Models (VLLMs), we propose a novel approach called Window Token Concatenation (WiCo). Specifically, we employ a sliding window to concatenate spatially adjacent visual tokens. However, directly concatenating these tokens may group diverse tokens into one, and thus obscure some fine details. To address this challenge, we propose fine-tuning the last few layers of the vision encoder to adaptively adjust the visual tokens, encouraging that those within the same window exhibit similar features. To further enhance the performance on fine-grained visual understanding tasks, we introduce WiCo+, which decomposes the visual tokens in later layers of the LLM. Such a design enjoys the merits of the large perception field of the LLM for fine-grained visual understanding while keeping a small number of visual tokens for efficient inference. We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors. The code is available: https://github.com/JackYFL/WiCo.

View on arXiv PDF Code

Similar