ASCLITLGDec 16, 2022

Fast Entropy-Based Methods of Word-Level Confidence Estimation for End-To-End Automatic Speech Recognition

arXiv:2212.08703v115 citationsh-index: 32
Originality Incremental advance
AI Analysis

This work addresses the need for reliable confidence estimation in speech recognition systems, offering an incremental improvement over existing methods with practical benefits in computational efficiency and adjustability.

The paper tackles the problem of estimating word-level confidence in end-to-end automatic speech recognition by introducing fast, non-trainable entropy-based methods that aggregate per-frame entropy for CTC and RNN-T models. The result shows these methods are up to 2 and 4 times better at detecting incorrect words compared to traditional maximum per-frame probability methods on LibriSpeech test sets.

This paper presents a class of new fast non-trainable entropy-based confidence estimation methods for automatic speech recognition. We show how per-frame entropy values can be normalized and aggregated to obtain a confidence measure per unit and per word for Connectionist Temporal Classification (CTC) and Recurrent Neural Network Transducer (RNN-T) models. Proposed methods have similar computational complexity to the traditional method based on the maximum per-frame probability, but they are more adjustable, have a wider effective threshold range, and better push apart the confidence distributions of correct and incorrect words. We evaluate the proposed confidence measures on LibriSpeech test sets, and show that they are up to 2 and 4 times better than confidence estimation based on the maximum per-frame probability at detecting incorrect words for Conformer-CTC and Conformer-RNN-T models, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes