Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter
This addresses the need for faster and more accurate rare word recognition in contextualized ASR systems, though it appears incremental as it builds on existing CTC and Transducer models.
The paper tackles the problem of slow context-biasing for rare and new words in ASR systems by introducing a CTC-based Word Spotter that matches CTC log-probabilities against a context graph to detect candidates, which are then integrated into the recognition output. The results show significant acceleration in context-biasing recognition with improved F-score and WER compared to baselines.
Accurate recognition of rare and new words remains a pressing problem for contextualized Automatic Speech Recognition (ASR) systems. Most context-biasing methods involve modification of the ASR model or the beam-search decoding algorithm, complicating model reuse and slowing down inference. This work presents a new approach to fast context-biasing with CTC-based Word Spotter (CTC-WS) for CTC and Transducer (RNN-T) ASR models. The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates. The valid candidates then replace their greedy recognition counterparts in corresponding frame intervals. A Hybrid Transducer-CTC model enables the CTC-WS application for the Transducer model. The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER compared to baseline methods. The proposed method is publicly available in the NVIDIA NeMo toolkit.