Token-Weighted RNN-T for Learning from Flawed Data
This addresses accuracy degradation in ASR models due to flawed training data, which is an incremental improvement for speech recognition applications.
The paper tackles the problem of training ASR models with flawed data, such as transcription errors in pseudo-labels or human annotations, by proposing a token-weighted RNN-T criterion to de-emphasize erroneous tokens. Results show up to 38% relative accuracy improvement in semi-supervised learning and recovery of 64%-99% of accuracy loss from transcription errors.
ASR models are commonly trained with the cross-entropy criterion to increase the probability of a target token sequence. While optimizing the probability of all tokens in the target sequence is sensible, one may want to de-emphasize tokens that reflect transcription errors. In this work, we propose a novel token-weighted RNN-T criterion that augments the RNN-T objective with token-specific weights. The new objective is used for mitigating accuracy loss from transcriptions errors in the training data, which naturally appear in two settings: pseudo-labeling and human annotation errors. Experiments results show that using our method for semi-supervised learning with pseudo-labels leads to a consistent accuracy improvement, up to 38% relative. We also analyze the accuracy degradation resulting from different levels of WER in the reference transcription, and show that token-weighted RNN-T is suitable for overcoming this degradation, recovering 64%-99% of the accuracy loss.