Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction
This work addresses the need for more accurate confidence estimation in ASR systems, which is crucial for downstream applications, though it is incremental as it builds on existing neural network-based methods.
The paper tackled the problem of improving confidence scores for automatic speech recognition by jointly learning word confidence, word deletion, and utterance confidence through multi-task learning, resulting in enhanced confidence metrics and a 3-5% relative reduction in word error rates on datasets without increasing model size.
Confidence scores are very useful for downstream applications of automatic speech recognition (ASR) systems. Recent works have proposed using neural networks to learn word or utterance confidence scores for end-to-end ASR. In those studies, word confidence by itself does not model deletions, and utterance confidence does not take advantage of word-level training signals. This paper proposes to jointly learn word confidence, word deletion, and utterance confidence. Empirical results show that multi-task learning with all three objectives improves confidence metrics (NCE, AUC, RMSE) without the need for increasing the model size of the confidence estimation module. Using the utterance-level confidence for rescoring also decreases the word error rates on Google's Voice Search and Long-tail Maps datasets by 3-5% relative, without needing a dedicated neural rescorer.