ASCLSDOct 21, 2024

Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning

arXiv:2410.16130v233 citationsh-index: 12ICASSP
Originality Incremental advance
AI Analysis

This work addresses reliability issues in large audio-language models for real-world applications, representing an incremental improvement.

The paper tackled the problem of hallucinations in large audio-language models by evaluating them on three tasks: object existence, temporal order, and object attribute, revealing limitations in these areas. It introduced a multi-turn chain-of-thought approach that significantly improved model performance across these tasks.

Recent advancements in large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information. However, these models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources, which undermine their reliability and real-world application. To systematically evaluate these issues, we propose three distinct tasks: object existence, temporal order, and object attribute within audio. These tasks assess the models' comprehension of critical audio information aspects. Our experimental results reveal limitations in these fundamental tasks, underscoring the need for better models in recognizing specific sound events, determining event sequences, and identifying sound sources. To improve performance in these areas, we introduce a multi-turn chain-of-thought approach, which demonstrates significantly improved model performance across the proposed tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes