AS CL LGSep 26, 2023

Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition

Dongji Gao, Hainan Xu, Desh Raj, Leibny Paola Garcia Perera, Daniel Povey, Sanjeev Khudanpur

arXiv:2309.15796v16.67 citationsh-index: 62Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of reducing reliance on expensive, well-curated data for ASR training, though it is an incremental improvement over existing methods.

The paper tackles the problem of training automatic speech recognition (ASR) systems with flawed, non-verbatim transcripts by proposing Omni-temporal Classification (OTC), a novel training criterion that incorporates label uncertainties. The result shows that OTC avoids performance degradation even with transcripts containing up to 70% errors, where conventional CTC models fail completely.

Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. In this paper, we propose Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. This allows the model to effectively learn speech-text alignments while accommodating errors present in the training transcripts. OTC extends the conventional CTC objective for imperfect transcripts by leveraging weighted finite state transducers. Through experiments conducted on the LibriSpeech and LibriVox datasets, we demonstrate that training ASR models with OTC avoids performance degradation even with transcripts containing up to 70% errors, a scenario where CTC models fail completely. Our implementation is available at https://github.com/k2-fsa/icefall.

View on arXiv PDF Code

Similar