TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree
This addresses the need for efficient and versatile context-biasing in ASR systems, offering a practical solution for applications requiring recognition of many key phrases, though it appears incremental as it builds on existing shallow fusion methods.
The paper tackles the problem of recognizing specific key phrases in contextualized Automatic Speech Recognition (ASR) by proposing a universal framework that supports CTC, Transducers, and Attention Encoder-Decoder models, achieving high efficiency with up to 20K key phrases without noticeable speed degradation and surpassing open-source approaches in accuracy and decoding speed.
Recognizing specific key phrases is an essential task for contextualized Automatic Speech Recognition (ASR). However, most existing context-biasing approaches have limitations associated with the necessity of additional model training, significantly slow down the decoding process, or constrain the choice of the ASR system type. This paper proposes a universal ASR context-biasing framework that supports all major types: CTC, Transducers, and Attention Encoder-Decoder models. The framework is based on a GPU-accelerated word boosting tree, which enables it to be used in shallow fusion mode for greedy and beam search decoding without noticeable speed degradation, even with a vast number of key phrases (up to 20K items). The obtained results showed high efficiency of the proposed method, surpassing the considered open-source context-biasing approaches in accuracy and decoding speed. Our context-biasing framework is open-sourced as a part of the NeMo toolkit.