SDASMar 17, 2020

High-Resolution Speaker Counting In Reverberant Rooms Using CRNN With Ambisonics Features

arXiv:2003.07839v118 citations
Originality Incremental advance
AI Analysis

This addresses the speaker counting problem for audio processing tasks like diarization and separation, but it is incremental as it builds on existing neural network approaches.

The paper tackles the problem of estimating the number of simultaneous speakers in audio recordings, achieving good accuracy at short-term frame resolution using a multichannel convolutional recurrent neural network trained on simulated data.

Speaker counting is the task of estimating the number of people that are simultaneously speaking in an audio recording. For several audio processing tasks such as speaker diarization, separation, localization and tracking, knowing the number of speakers at each timestep is a prerequisite, or at least it can be a strong advantage, in addition to enabling a low latency processing. For that purpose, we address the speaker counting problem with a multichannel convolutional recurrent neural network which produces an estimation at a short-term frame resolution. We trained the network to predict up to 5 concurrent speakers in a multichannel mixture, with simulated data including many different conditions in terms of source and microphone positions, reverberation, and noise. The network can predict the number of speakers with good accuracy at frame resolution.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes