Audio-Visual Exchange-Aware Token Pruning for Efficient Audio-Visual Captioning
For researchers and practitioners in multimodal LLMs, this work offers an efficient token pruning method that maintains captioning quality while reducing computational cost.
The paper tackles the problem of high computational cost in audio-visual captioning due to quadratic self-attention scaling with many tokens. They propose AVEX-Prune, an RL-based token pruning method with a token exchange strategy, achieving full-token quality at a 40% retention ratio (54.5 vs. 54.6 on VILA 1.5-8B, 57.0 vs. 56.8 on VideoLLaMA 2).
Audio-visual captioning generates natural language descriptions from video and audio content. Multimodal LLMs have advanced this task, but both modalities contribute many tokens to the LLM input, where prefill self-attention scales quadratically. Existing token-pruning methods usually retain tokens by attention, saliency, or cross-entropy loss, yet the hard threshold selection makes it difficult to retain tokens that are truly valuable, especially for high-confusing tokens near the decision boundary. To this end, we propose a AVEX-Prune, an RL-based audio-visual dynamic token pruning method in this work. In our AVEX-Prune, an audio-visual token exchange strategy is proposed to select truly valuable tokens by replacing low-confidence retained tokens with high-confidence candidate tokens from the same or the other modality, and measuring the differences in caption generation from token swaps. AVEX-Prune preserves full-token quality at a 40% retention ratio on both VILA 1.5-8B (54.5 vs. 54.6) and VideoLLaMA 2 (57.0 vs. 56.8).