SDASJun 4

Probing Spatial Structure in Pretrained Audio Representations

arXiv:2606.0554413.6Has Code
Predicted impact top 30% in SD · last 90 daysOriginality Incremental advance
AI Analysis

For researchers using pretrained audio representations, this work provides a benchmark and reveals systematic biases in spatial encoding.

The paper introduces the SARL benchmark to evaluate spatial information in pretrained audio models, finding that source factors are easier to decode than room factors and that input configuration and training paradigm shape spatial encoding.

Pretrained spatial audio encoders are increasingly used as general-purpose representations for perceptual tasks, yet their spatial encoding capabilities remain poorly understood. We introduce the Spatial Audio Representation Learning (SARL) benchmark, a controlled framework for evaluating spatial information in pretrained audio models. SARL probes source-level factors (azimuth, elevation, distance, class) and room-level factors (RT60, volume, shape). Experiments across diverse encoders reveal three patterns: input configuration and training paradigm shape spatial encoding; source factors are consistently easier to decode than room factors; and sensitivity analysis under controlled perturbations shows heterogeneous responses to source and room variation. These results reveal systematic biases in current pretrained audio representations. SARL is released as an open-source benchmark for reproducible evaluation of spatial audio representations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes