Learning Multi-Target TDOA Features for Sound Event Localization and Detection
This work addresses sound event localization and detection for audio systems using microphone arrays, representing an incremental improvement with a novel method for a known bottleneck.
The paper tackled the problem of sound event localization and detection by proposing a new feature called NGCC-PHAT, which learns TDOA features for multiple overlapping sound events, resulting in improved localization performance on the STARSS23 dataset compared to standard methods.
Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events. These features can be used as a drop-in replacement for GCC-PHAT inputs to a SELD-network. We test our method on the STARSS23 dataset and demonstrate improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features.