MMAICVSDMar 15

DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression

arXiv:2603.1568585.01 citationsh-index: 7Has Code
AI Analysis

This addresses efficiency bottlenecks for researchers and practitioners deploying omnimodal AI systems, though it appears incremental as an improvement over existing compression methods.

The paper tackles the problem of expensive inference in omnimodal large language models due to long multimodal token sequences by proposing DASH, a training-free framework that aligns token compression with semantic structure, achieving higher compression ratios while maintaining superior accuracy across multiple benchmarks.

Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window partitioning and attention-based pruning, which overlook the piecewise semantic structure of audio-visual signals and become fragile under aggressive token reduction. We propose Dynamic Audio-driven Semantic cHunking (DASH), a training-free framework that aligns token compression with semantic structure. DASH treats audio embeddings as a semantic anchor and detects boundary candidates via cosine-similarity discontinuities, inducing dynamic, variable-length segments that approximate the underlying piecewise-coherent organization of the sequence. These boundaries are projected onto video tokens to establish explicit cross-modal segmentation. Within each segment, token retention is determined by a tri-signal importance estimator that fuses structural boundary cues, representational distinctiveness, and attention-based salience, mitigating the sparsity bias of attention-only selection. This structure-aware allocation preserves transition-critical tokens while reducing redundant regions. Extensive experiments on AVUT, VideoMME, and WorldSense demonstrate that DASH maintains superior accuracy while achieving higher compression ratios compared to prior methods. Code is available at: https://github.com/laychou666/DASH.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes