SDCLMar 14

Causal Tracing of Audio-Text Fusion in Large Audio Language Models

arXiv:2603.1376832.71 citationsh-index: 8
Predicted impact top 6% in SD · last 90 daysOriginality Synthesis-oriented
AI Analysis

This work addresses the unclear mechanisms of audio-text fusion in LALMs, providing insights for researchers in multimodal AI, though it is incremental as it adapts existing methods to a new domain.

The study investigated how large audio language models integrate acoustic and textual information by applying causal tracing to analyze internal information flow, revealing distinct fusion strategies and identifying the final sequence token as a key bottleneck for audio information retrieval.

Despite the strong performance of large audio language models (LALMs) in various tasks, exactly how and where they integrate acoustic features with textual context remains unclear. We adapt causal tracing to investigate the internal information flow of LALMs during audio comprehension. By conducting layer-wise and token-wise analyses across DeSTA, Qwen, and Voxtral, we evaluate the causal effects of individual hidden states. Layer-wise analysis identifies different fusion strategies, from progressive integration in DeSTA to abrupt late-stage fusion in Qwen. Token-wise analysis shows that the final sequence token acts as an informational bottleneck where the network decisively retrieves relevant information from the audio. We also observe an attention-like query mechanism at intermediate token positions that triggers the model to pull task-relevant audio context. These findings provide a clear characterization of when and where multi-modal integration occurs within LALMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes