Multichannel CRNN for Speaker Counting: an Analysis of Performance
This work provides an incremental analysis for researchers developing speaker counting systems, particularly those using CRNN architectures, by identifying factors influencing prediction accuracy.
This paper analyzes a previously developed multichannel convolutional recurrent neural network (CRNN) for speaker counting, which estimates the number of simultaneous speakers in an audio recording at a short-term frame resolution. The authors empirically demonstrate that for a given frame, there is an optimal position in the input sequence for best prediction accuracy, and link this optimal position to the input sequence length and convolutional filter size.
Speaker counting is the task of estimating the number of people that are simultaneously speaking in an audio recording. For several audio processing tasks such as speaker diarization, separation, localization and tracking, knowing the number of speakers at each timestep is a prerequisite, or at least it can be a strong advantage, in addition to enabling a low latency processing. In a previous work, we addressed the speaker counting problem with a multichannel convolutional recurrent neural network which produces an estimation at a short-term frame resolution. In this work, we show that, for a given frame, there is an optimal position in the input sequence for best prediction accuracy. We empirically demonstrate the link between that optimal position, the length of the input sequence and the size of the convolutional filters.