ASCLLGSep 26, 2023

Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition

arXiv:2309.15796v17 citationsh-index: 63Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of reducing reliance on expensive, well-curated data for ASR training, though it is an incremental improvement over existing methods.

The paper tackles the problem of training automatic speech recognition (ASR) systems with flawed, non-verbatim transcripts by proposing Omni-temporal Classification (OTC), a novel training criterion that incorporates label uncertainties. The result shows that OTC avoids performance degradation even with transcripts containing up to 70% errors, where conventional CTC models fail completely.

Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. In this paper, we propose Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. This allows the model to effectively learn speech-text alignments while accommodating errors present in the training transcripts. OTC extends the conventional CTC objective for imperfect transcripts by leveraging weighted finite state transducers. Through experiments conducted on the LibriSpeech and LibriVox datasets, we demonstrate that training ASR models with OTC avoids performance degradation even with transcripts containing up to 70% errors, a scenario where CTC models fail completely. Our implementation is available at https://github.com/k2-fsa/icefall.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes