SDCLMMASSep 23, 2025

Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models

arXiv:2509.18816v14 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in multi-modal models for audio processing, offering an efficient solution to improve audio reasoning capabilities.

The paper tackles the problem of audio-textual attention imbalance in Large Audio-Language Models, which causes suboptimal performance on audio reasoning tasks, and proposes MATA, a training-free method that dynamically adjusts attention to audio tokens, resulting in consistent performance gains and enabling an open-source model to surpass proprietary Gemini 2.0 Flash on the MMAR benchmark.

Large Audio-Language Models (LALMs) often suffer from audio-textual attention imbalance, prioritizing text over acoustic information, particularly in the multi-modal fusion layers of the Transformer architecture. This bias hinders their ability to fully utilize acoustic cues, causing suboptimal performance on audio reasoning tasks. To mitigate this, we propose \textbf{MATA}, a novel training-free method that dynamically pushes LALMs to pay \textbf{M}ore \textbf{A}ttention \textbf{T}o \textbf{A}udio tokens within the self-attention mechanism. Specifically, MATA intervenes post raw attention scoring, targeting only the last token in intermediate layers without introducing additional parameters or computational overhead. Experiments on the MMAU and MMAR benchmarks confirm MATA's effectiveness, with consistent performance gains. Notably, on MMAR, MATA enables an open-source model to surpass the proprietary Gemini 2.0 Flash for the first time. Our work provides an efficient solution to mitigate attention bias and opens a new research direction for enhancing the audio-processing capabilities of multi-modal models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes