CVJan 30

VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration

arXiv:2601.22674v26 citationsh-index: 10Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of computational inefficiency for MLLM users, but it is incremental as it builds on existing token reduction methods.

The paper tackles the high computational cost of multimodal large language models (MLLMs) due to excessive visual tokens by proposing VisionTrim, a unified framework for training-free acceleration, which achieves performance superiority across diverse benchmarks.

Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at: https://github.com/hanxunyu/VisionTrim.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes