ASCLSDMay 22, 2024

Contextualized Automatic Speech Recognition with Dynamic Vocabulary

NVIDIA
arXiv:2405.13344v211 citationsh-index: 19SLT
Originality Incremental advance
AI Analysis

This work addresses the challenge of enhancing rare word recognition in ASR systems, offering a more efficient solution without additional modules, though it is incremental as it builds on existing deep biasing techniques.

The paper tackles the problem of naive sequence decomposition in deep biasing for automatic speech recognition, which lowers the occurrence probability of bias phrases, by proposing a dynamic vocabulary that adds bias tokens as single tokens during inference. The result is an improvement in bias phrase word error rate by 3.1 to 4.9 points on English and Japanese datasets compared to conventional methods.

Deep biasing (DB) enhances the performance of end-to-end automatic speech recognition (E2E-ASR) models for rare words or contextual phrases using a bias list. However, most existing methods treat bias phrases as sequences of subwords in a predefined static vocabulary. This naive sequence decomposition produces unnatural token patterns, significantly lowering their occurrence probability. More advanced techniques address this problem by expanding the vocabulary with additional modules, including the external language model shallow fusion or rescoring. However, they result in increasing the workload due to the additional modules. This paper proposes a dynamic vocabulary where bias tokens can be added during inference. Each entry in a bias list is represented as a single token, unlike a sequence of existing subword tokens. This approach eliminates the need to learn subword dependencies within the bias phrases. This method is easily applied to various architectures because it only expands the embedding and output layers in common E2E-ASR architectures. Experimental results demonstrate that the proposed method improves the bias phrase WER on English and Japanese datasets by 3.1 -- 4.9 points compared with the conventional DB method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes