SDCLASSep 1, 2024

LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization

arXiv:2409.00819v17 citationsh-index: 29
Originality Synthesis-oriented
AI Analysis

This dataset addresses the problem of decoding 'Who said What and When' in complex scenarios like meetings for researchers in speech processing, but it is incremental as it builds on existing data generation methods.

The paper tackles the challenge of processing multi-talker speech in reverberant environments by introducing LibriheavyMix, a 20,000-hour dataset for single-channel speech separation, ASR, and speaker diarization, and provides a benchmark pipeline that shows broad applicability on the WHAMR! dataset.

The evolving speech processing landscape is increasingly focused on complex scenarios like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions. Existing methodologies for addressing these challenges fall into two categories: multi-channel and single-channel solutions. Single-channel approaches, notable for their generality and convenience, do not require specific information about microphone arrays. This paper presents a large-scale far-field overlapping speech dataset, crafted to advance research in speech separation, recognition, and speaker diarization. This dataset is a critical resource for decoding ``Who said What and When'' in multi-talker, reverberant environments, a daunting challenge in the field. Additionally, we introduce a pipeline system encompassing speech separation, recognition, and diarization as a foundational benchmark. Evaluations on the WHAMR! dataset validate the broad applicability of the proposed data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes