On Modeling ASR Word Confidence
This work addresses the issue of ASR errors affecting downstream applications for users of speech recognition systems, though it appears incremental as it builds on existing Word Confusion Networks.
The authors tackled the problem of improving ASR word confidence estimation by introducing a Heterogeneous Word Confusion Network and a score calibration method, resulting in a more accurate word sequence than the default 1-best result and enhanced reliability for recognizer combination.
We present a new method for computing ASR word confidences that effectively mitigates the effect of ASR errors for diverse downstream applications, improves the word error rate of the 1-best result, and allows better comparison of scores across different models. We propose 1) a new method for modeling word confidence using a Heterogeneous Word Confusion Network (HWCN) that addresses some key flaws in conventional Word Confusion Networks, and 2) a new score calibration method for facilitating direct comparison of scores from different models. Using a bidirectional lattice recurrent neural network to compute the confidence scores of each word in the HWCN, we show that the word sequence with the best overall confidence is more accurate than the default 1-best result of the recognizer, and that the calibration method can substantially improve the reliability of recognizer combination.