CV AI CLJan 29, 2025

LFTR: Learning-Free Token Reduction for Multimodal Large Language Models

arXiv:2501.17391v36.21 citationsh-index: 1Has Code

Originality Incremental advance

AI Analysis

This addresses efficiency issues for deploying MLLMs in resource-constrained environments, though it is incremental as it builds on existing token reduction ideas.

The paper tackles the high computational demands and slow inference times of Multimodal Large Language Models (MLLMs) by introducing LFTR, a learning-free token reduction method that reduces visual tokens by up to 16× while maintaining or improving performance on vision question-answering benchmarks.

Multimodal Large Language Models (MLLMs) have demonstrated exceptional success in various multimodal tasks, yet their deployment is frequently limited by substantial computational demands and prolonged inference times. Given that the vision modality typically contains more comprehensive information than the text modality, resulting in encoded representations comprising an extensive number of tokens, leading to significant computational overhead due to the quadratic complexity of the attention mechanism. Current token reduction methods are typically restricted to specific model architectures and often necessitate extensive retraining or fine-tuning, restricting their applicability to many state-of-the-art models. In this paper, we introduce a learning-free token reduction (LFTR) method designed for MLLMs. LFTR can be seamlessly integrated into most open-source MLLM architectures without requiring additional fine-tuning. By capitalizing on the redundancy in visual representations, our approach effectively reduces tokens while preserving the general inference performance of MLLMs. We conduct experiments on multiple MLLM architectures (LLaVA, MiniGPT, QwenVL), and our results show that LFTR achieves up to a $16\times$ reduction of visual tokens while maintaining or even enhancing performance on mainstream vision question-answering benchmarks, all in a learning-free setting. Additionally, LFTR is complementary to other acceleration techniques, such as vision encoder compression and post-training quantization, further promoting the efficient deployment of MLLMs. Our project is available at https://anonymous.4open.science/r/LFTR-AAAI-0528.

View on arXiv PDF

Similar