Zero-shot Context Biasing with Trie-based Decoding using Synthetic Multi-Pronunciation
This addresses the problem of limited training data and ambiguous pronunciations for rare words in ASR systems, representing an incremental improvement over existing contextual biasing methods.
The paper tackled the challenge of recognizing out-of-vocabulary words in contextual automatic speech recognition by proposing a zero-shot method that uses synthetic multi-pronunciation variants and trie-based decoding, resulting in a 43-44% reduction in biased-word error rate on the LibriSpeech dataset while keeping unbiased error rates stable.
Contextual automatic speech recognition (ASR) systems allow for recognizing out-of-vocabulary (OOV) words, such as named entities or rare words. However, it remains challenging due to limited training data and ambiguous or inconsistent pronunciations. In this paper, we propose a synthesis-driven multi-pronunciation contextual biasing method that performs zero-shot contextual ASR on a pretrained Whisper model. Specifically, we leverage text-to-speech (TTS) systems to synthesize diverse speech samples containing each target rare word, and then use the pretrained Whisper model to extract multiple predicted pronunciation variants. These variant token sequences are compiled into a prefix-trie, which assigns rewards to beam hypotheses in a shallow-fusion manner during beam-search decoding. Subsequently, any recognized variant is mapped back to the original rare word in the final transcription. The evaluation results on the LibriSpeech dataset show that our method reduces biased-word error rate (B-WER) by 43% on test-clean and 44% on test-other while maintaining unbiased-WER (U-WER) essentially unchanged.