SD AI CL MM ASMay 12, 2025

Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge

Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, S Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami

arXiv:2505.07365v117.87 citationsh-index: 56

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of advancing audio understanding and reasoning for AI agents, but it is incremental as it builds on existing audio-language models and benchmarks.

The paper introduces Task 5 of the DCASE 2025 Challenge, an Audio Question Answering benchmark across multiple sound domains, and reports preliminary results showing strong variation in top-1 accuracy across models and subsets.

We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the world effectively.

View on arXiv PDF

Similar