Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition
This work addresses inefficiencies in pseudo-labeling for semi-supervised ASR, offering a scalable solution for scenarios with limited labeled data or domain shifts, though it is incremental as it builds on existing methods like mean teacher.
The paper tackles the problem of inefficient pseudo-labeling in semi-supervised speech recognition by proposing momentum pseudo-labeling (MPL), which uses online and offline models to interact and learn from each other, resulting in improved ASR performance as demonstrated in experiments with varying data amounts and domain mismatches.
Pseudo-labeling (PL) has been shown to be effective in semi-supervised automatic speech recognition (ASR), where a base model is self-trained with pseudo-labels generated from unlabeled data. While PL can be further improved by iteratively updating pseudo-labels as the model evolves, most of the previous approaches involve inefficient retraining of the model or intricate control of the label update. We present momentum pseudo-labeling (MPL), a simple yet effective strategy for semi-supervised ASR. MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method. The online model is trained to predict pseudo-labels generated on the fly by the offline model. The offline model maintains a momentum-based moving average of the online model. MPL is performed in a single training process and the interaction between the two models effectively helps them reinforce each other to improve the ASR performance. We apply MPL to an end-to-end ASR model based on the connectionist temporal classification. The experimental results demonstrate that MPL effectively improves over the base model and is scalable to different semi-supervised scenarios with varying amounts of data or domain mismatch.