ASSDNov 3, 2020

Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis

arXiv:2011.02014v1117 citations
AI Analysis

This addresses the problem of transcribing multi-speaker meetings for applications like automatic subtitle generation, but it is incremental as it builds on existing pipeline components.

The paper tackled multi-speaker speech recognition in unsegmented recordings by proposing an end-to-end modular system that integrates speech separation, diarization, and recognition components, achieving a speaker-attributed word error rate of 12.7% on the LibriCSS meeting data.

Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlapping ASR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes