SD AIMay 10

The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models

Yuhuan You, Lai Wei, Xihong Wu, Tianshu Qu

arXiv:2601.0295413.63 citationsh-index: 11

Predicted impact top 47% in SD · last 90 daysOriginality Incremental advance

AI Analysis

For researchers in audio-language models, this work provides a clear task hierarchy and a training framework for spatial reasoning, though the benchmark is synthetic and real-world validation is limited.

The paper formalizes spatial audio-language understanding as audio scene analysis (ASA) and proposes TWNM, a framework using First-Order Ambisonics simulation and slot-regularized spatial representations. On a controlled benchmark, TWNM achieves 70.8% overall accuracy and 79.76% on scene-level multiple-choice QA.

Large audio-language models have made rapid progress in recognizing what is present in an audio clip, but spatial audio-language understanding still lacks a clear task interface. A model must also decide where sound events occur, which semantic and spatial attributes belong to the same auditory object, how multiple objects are arranged, and whether a scene-level answer is physically plausible. We formalize this capability as audio scene analysis (ASA), a three-level problem spanning atomic perception, relational integration, and cognitive reasoning. We propose The World is Not Mono (TWNM), a framework that equips audio-language models with explicit spatial evidence. TWNM uses physically grounded First-Order Ambisonics (FOA) simulation for controllable supervision, learns slot-regularized spatial representations from multichannel audio, fuses them with semantic audio features, and trains with a progressive curriculum ending in preference optimization over metadata-derived answers and auxiliary format/evidence rewards. To operationalize ASA, we build a controlled benchmark from scene metadata, covering localization, attribute binding, spatial comparison, scene abduction, and counterfactual reasoning. On this benchmark, TWNM achieves 70.8% overall accuracy, 66.4% on spatial-family tasks, and 79.76% on mixed L3 scene-level multiple-choice QA. We also audit monaural and binaural reference systems as diagnostic references with explicit audit labels, since they differ in spatial input, training interface, and output format. The supported claim is that a clearly defined ASA hierarchy, FOA-conditioned spatial representations, and metadata-grounded training enable controlled, auditable spatial audio-language reasoning, with STARSS23 providing a limited real-recording diagnostic.

View on arXiv PDF

Similar