The Speed Submission to DIHARD II: Contributions & Lessons Learned
This work addresses speaker diarization in realistic, multi-domain settings for speech processing applications, but it is incremental as it builds on existing methods within a challenge framework.
The paper describes the Speed team's speaker diarization systems for the DIHARD II challenge, which significantly outperformed baselines, and analyzes the impact of various components like speech enhancement and clustering on performance.
This paper describes the speaker diarization systems developed for the Second DIHARD Speech Diarization Challenge (DIHARD II) by the Speed team. Besides describing the system, which considerably outperformed the challenge baselines, we also focus on the lessons learned from numerous approaches that we tried for single and multi-channel systems. We present several components of our diarization system, including categorization of domains, speech enhancement, speech activity detection, speaker embeddings, clustering methods, resegmentation, and system fusion. We analyze and discuss the effect of each such component on the overall diarization performance within the realistic settings of the challenge.