ASCLLGSDJan 24, 2020

Semi-supervised ASR by End-to-end Self-training

arXiv:2001.09128v258 citations
AI Analysis

This work addresses the problem of limited labeled data for speech recognition researchers, but it is incremental as it builds on existing self-training and end-to-end ASR methods.

The paper tackles the data sparsity issue in end-to-end automatic speech recognition by proposing a self-training method that iteratively generates pseudo-labels from unsupervised data to augment supervised training, resulting in a 14.4% relative WER improvement over a base system and reducing the performance gap to an oracle system by 50% on the WSJ corpus.

While deep learning based end-to-end automatic speech recognition (ASR) systems have greatly simplified modeling pipelines, they suffer from the data sparsity issue. In this work, we propose a self-training method with an end-to-end system for semi-supervised ASR. Starting from a Connectionist Temporal Classification (CTC) system trained on the supervised data, we iteratively generate pseudo-labels on a mini-batch of unsupervised utterances with the current model, and use the pseudo-labels to augment the supervised data for immediate model update. Our method retains the simplicity of end-to-end ASR systems, and can be seen as performing alternating optimization over a well-defined learning objective. We also perform empirical investigations of our method, regarding the effect of data augmentation, decoding beamsize for pseudo-label generation, and freshness of pseudo-labels. On a commonly used semi-supervised ASR setting with the WSJ corpus, our method gives 14.4% relative WER improvement over a carefully-trained base system with data augmentation, reducing the performance gap between the base system and the oracle system by 50%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes