3DAE: Binaural Quality Assessment for Audio Novel View Synthesis with Spatial Maps and Benchmark
This framework provides interpretable failure-mode summaries and visual maps, which are crucial for developers optimizing audio novel-view synthesis models.
This paper introduces a diagnostic framework, 3DAE, for evaluating binaural quality in audio novel view synthesis by generating time-frequency audio error maps for various parameters like magnitude, ILD, and IPD. Applying this framework to ViGAS outputs on Replay-NVAS and SoundSpaces datasets revealed distinct failure modes: temporal misalignment on Replay-NVAS and ILD mismatch on SoundSpaces.
3D audio and novel-view acoustic synthesis models are usually evaluated with global metrics.However, global metrics often hide where and why binaural prediction fails. We propose a full-reference diagnostic framework that uses time-frequency audio error maps for magnitude, ILD, IPD, temporal alignment, loudness, and high-frequency failures, forming a 3D Audio Error Map (3DAE Map) for visual inspection. We frame these diagnostics into a model-agnostic benchmark, Spatial Audio Error Bench (3DAE Bench), which takes arbitrary ground-truth and predicted binaural pairs and reports the prediction quality of audio novel-view synthesis models. Experiments on ViGAS outputs over Replay-NVAS and SoundSpaces show different dominant failure modes: temporal misalignment on Replay-NVAS and ILD mismatch on SoundSpaces. Overall, the framework provides interpretable failure-mode summaries and intuitive visual maps for audio Novel-view-synthesis model development optimization.