ASAISDSep 1, 2025

AHAMask: Reliable Task Specification for Large Audio Language Models without Instructions

arXiv:2509.01787v21 citationsh-index: 13
Originality Highly original
AI Analysis

This provides a reliable task specification method for LALMs, addressing a key bottleneck in audio AI applications.

The paper tackles the problem of prompt sensitivity in large audio language models (LALMs) by proposing AHAMask, which masks specific attention heads to trigger acoustic tasks without instructions, achieving comparable or better performance than instruction-based methods.

Although current large audio language models (LALMs) extend text large language models (LLMs) with generic acoustic understanding abilities, they usually suffer from prompt sensitivity, where different instructions of the same intention can yield drastically different outcomes. In this work, we propose AHAMask, where we simply mask some of the attention heads in the decoder-only LLM backbone of LALMs, to trigger specific acoustic task functionalities without instructions. These masks are efficiently obtained by training on an LALM, with the number of trainable parameters equal to the attention head count in its LLM backbone. We show by experiments that applying such selective attention head masks achieves comparable or even better performance than using instructions, either on single or composite tasks. Besides achieving reliable acoustic task specification for LALMs, this also reveals that LALMs exhibit certain "functional pathways" in their attention heads.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes