SDAICLMMASMay 12, 2025

Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge

arXiv:2505.07365v17 citationsh-index: 56
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of advancing audio understanding and reasoning for AI agents, but it is incremental as it builds on existing audio-language models and benchmarks.

The paper introduces Task 5 of the DCASE 2025 Challenge, an Audio Question Answering benchmark across multiple sound domains, and reports preliminary results showing strong variation in top-1 accuracy across models and subsets.

We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the world effectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes