ASCLLGMar 11, 2021

Learning Word-Level Confidence For Subword End-to-End ASR

arXiv:2103.06716v137 citations
Originality Incremental advance
AI Analysis

This addresses the rare word recognition problem for on-device E2E ASR models by enabling model selection with server-based hybrid models, though it is incremental as it builds on prior confidence estimation methods.

The paper tackles the problem of inaccurate word-level confidence estimation in subword-based end-to-end ASR due to non-unique tokenization, proposing a self-attention model that learns word-level confidence without subword tokenization and improves metrics like NCE and AUC on Voice Search and long-tail test sets.

We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR). Although prior works have proposed training auxiliary confidence models for ASR systems, they do not extend naturally to systems that operate on word-pieces (WP) as their vocabulary. In particular, ground truth WP correctness labels are needed for training confidence models, but the non-unique tokenization from word to WP causes inaccurate labels to be generated. This paper proposes and studies two confidence models of increasing complexity to solve this problem. The final model uses self-attention to directly learn word-level confidence without needing subword tokenization, and exploits full context features from multiple hypotheses to improve confidence accuracy. Experiments on Voice Search and long-tail test sets show standard metrics (e.g., NCE, AUC, RMSE) improving substantially. The proposed confidence module also enables a model selection approach to combine an on-device E2E model with a hybrid model on the server to address the rare word recognition problem for the E2E model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes