SALADnet: Self-Attentive multisource Localization in the Ambisonics Domain
This work addresses multi-speaker localization for audio processing applications, but it is incremental as it modifies an existing method with self-attention.
The paper tackled robust multi-speaker localization from Ambisonics recordings by proposing a self-attention neural network that replaces recurrent layers in a state-of-the-art CRNN. The results showed that most architectures performed on par or outperformed the baseline, especially with multiple speakers, and achieved faster execution through parallel computing.
In this work, we propose a novel self-attention based neural network for robust multi-speaker localization from Ambisonics recordings. Starting from a state-of-the-art convolutional recurrent neural network, we investigate the benefit of replacing the recurrent layers by self-attention encoders, inherited from the Transformer architecture. We evaluate these models on synthetic and real-world data, with up to 3 simultaneous speakers. The obtained results indicate that the majority of the proposed architectures either perform on par, or outperform the CRNN baseline, especially in the multisource scenario. Moreover, by avoiding the recurrent layers, the proposed models lend themselves to parallel computing, which is shown to produce considerable savings in execution time.