SDAIMMROASNov 30, 2017

Deep Neural Networks for Multiple Speaker Detection and Localization

arXiv:1711.11565v3198 citations
Originality Incremental advance
AI Analysis

This work provides an improved method for multiple speaker detection and localization, which is crucial for robust human-robot interaction in complex acoustic environments.

This paper addresses the problem of simultaneous detection and localization of multiple sound sources using deep neural networks. The authors propose a likelihood-based output encoding and sub-band cross-correlation features, demonstrating significant performance improvement over spatial spectrum-based approaches on real robot data.

We propose to use neural networks for simultaneous detection and localization of multiple sound sources in human-robot interaction. In contrast to conventional signal processing techniques, neural network-based sound source localization methods require fewer strong assumptions about the environment. Previous neural network-based methods have been focusing on localizing a single sound source, which do not extend to multiple sources in terms of detection and localization. In this paper, we thus propose a likelihood-based encoding of the network output, which naturally allows the detection of an arbitrary number of sources. In addition, we investigate the use of sub-band cross-correlation information as features for better localization in sound mixtures, as well as three different network architectures based on different motivations. Experiments on real data recorded from a robot show that our proposed methods significantly outperform the popular spatial spectrum-based approaches.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes