CVAIMay 12

Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs

arXiv:2605.1160585.7
Predicted impact top 21% in CV · last 90 daysOriginality Incremental advance
AI Analysis

It addresses the computational overhead of Omni-LLMs for real-world deployment by enabling efficient token reduction without sacrificing performance.

ContextGuard is an inference-time token pruning framework for Omni-LLMs that preserves broad audio-visual context while removing cross-modal redundancy, achieving full-token-level performance on five of six benchmarks while pruning 55% of input tokens on Qwen2.5-Omni 7B.

Omnimodal Large Language Models (Omni-LLMs) incur substantial computational overhead due to the large number of multimodal input tokens they process, making token reduction essential for real-world deployment. Existing Omni-LLM pruning methods typically reduce this cost by selecting tokens that are important for the current query or strongly aligned with cross-modal cues. However, such strategies can discard evidence that falls outside these criteria, even when needed for different questions or for understanding context beyond aligned audio-visual cues. To address this limitation, we reframe Omni-LLM token reduction as preserving broad audio-visual context while removing cross-modal redundancy. We propose ContextGuard, an inference-time token pruning framework built on this principle. ContextGuard predicts coarse visual semantics from audio and prunes video tokens whose coarse semantics are likely recoverable from audio, while retaining additional video tokens to preserve localized visual details that audio alone cannot specify. For further compression, our method merges temporally similar video tokens. The framework requires no downstream LLM fine-tuning and uses only an independently trained lightweight predictor. On Qwen2.5-Omni and Video-SALMONN2+ at 3B and 7B scales across six audio-visual benchmarks, ContextGuard outperforms prior inference-time pruning methods while pruning more tokens. Notably, on Qwen2.5-Omni 7B, ContextGuard achieves full-token-level performance on five of six benchmarks while pruning 55% of input tokens.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes