SDAICLMMDec 30, 2025

AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives

arXiv:2512.24052v25 citationsh-index: 10Has Code
Originality Incremental advance
AI Analysis

It addresses grounding failures in audio-language models for applications requiring temporal reasoning, with incremental improvements over existing methods.

The paper tackles hallucinations in Large Audio-Language Models by introducing the AHA framework, which uses counterfactual hard negative mining to align models, resulting in a 13.7% improvement on a diagnostic benchmark and gains on public benchmarks like 1.3% on MMAU-Test.

Although Large Audio-Language Models (LALMs) deliver state-of-the-art (SOTA) performance, they frequently suffer from hallucinations, e.g. generating text not grounded in the audio input. We analyze these grounding failures and identify a distinct taxonomy: Event Omission, False Event Identity, Temporal Relation Error, and Quantitative Temporal Error. To address this, we introduce the AHA (Audio Hallucination Alignment) framework. By leveraging counterfactual hard negative mining, our pipeline constructs a high-quality preference dataset that forces models to distinguish strict acoustic evidence from linguistically plausible fabrications. Additionally, we establish AHA-Eval, a diagnostic benchmark designed to rigorously test these fine-grained temporal reasoning capabilities. We apply this data to align Qwen2.5-Omni. The resulting model, Qwen-Audio-AHA, achieves a 13.7% improvement on AHA-Eval. Crucially, this benefit generalizes beyond our diagnostic set. Our model shows substantial gains on public benchmarks, including 1.3% on MMAU-Test and 1.6% on MMAR, outperforming latest SOTA methods. The model and dataset are open-sourced at https://github.com/LLM-VLM-GSL/AHA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes