Qiankun Ma

h-index3
2papers

2 Papers

CVFeb 23Code
ApET: Approximation-Error Guided Token Compression for Efficient VLMs

Qiankun Ma, Ziyao Zhang, Haofei Wang et al.

Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an Approximation-Error guided Token compression framework. ApET first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens. Extensive experiments across multiple VLMs and benchmarks demonstrate that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks, while compressing the token budgets by 88.9% and 87.5%, respectively. Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference acceleration and making VLM deployment more practical. Code is available at https://github.com/MaQianKun0/ApET.

CVJul 18, 2025
Training-free Token Reduction for Vision Mamba

Qiankun Ma, Ziyao Zhang, Chi Su et al.

Vision Mamba has emerged as a strong competitor to Vision Transformers (ViTs) due to its ability to efficiently capture long-range dependencies with linear computational complexity. While token reduction, an effective compression technique in ViTs, has rarely been explored in Vision Mamba. Exploring Vision Mamba's efficiency is essential for enabling broader applications. However, we find that directly applying existing token reduction techniques for ViTs to Vision Mamba leads to significant performance degradation. This is primarily because Mamba is a sequence model without attention mechanisms, whereas most token reduction techniques for ViTs rely on attention mechanisms for importance measurement and overlook the order of compressed tokens. In this paper, we investigate a Mamba structure-aware importance score to evaluate token importance in a simple and effective manner. Building on this score, we further propose MTR, a training-free \textbf{M}amba \textbf{T}oken \textbf{R}eduction framework. Without the need for training or additional tuning parameters, our method can be seamlessly integrated as a plug-and-play component across various Mamba models. Extensive experiments demonstrate that our approach significantly reduces computational workload while minimizing performance impact across various tasks and multiple backbones. Notably, MTR reduces FLOPs by approximately 40\% on the Vim-B backbone, with only a 1.6\% drop in ImageNet performance without retraining.