CL LG ASNov 5, 2019

RNN-T For Latency Controlled ASR With Improved Beam Search

Mahaveer Jain, Kjell Schubert, Jay Mahadeokar, Ching-Feng Yeh, Kaustubh Kalgaonkar, Anuroop Sriram, Christian Fuegen, Michael L. Seltzer

arXiv:1911.01629v24.647 citations

Originality Incremental advance

AI Analysis

This work addresses latency constraints in speech recognition applications, but it is incremental as it builds on existing RNN-T methods with specific optimizations.

The authors tackled the problem of latency-controlled automatic speech recognition by adapting RNN Transducers for tunable latency and improving their beam search decoding speed, achieving comparable word error rates and better computational efficiency than a hybrid baseline on an English videos dataset.

Neural transducer-based systems such as RNN Transducers (RNN-T) for automatic speech recognition (ASR) blend the individual components of a traditional hybrid ASR systems (acoustic model, language model, punctuation model, inverse text normalization) into one single model. This greatly simplifies training and inference and hence makes RNN-T a desirable choice for ASR systems. In this work, we investigate use of RNN-T in applications that require a tune-able latency budget during inference time. We also improved the decoding speed of the originally proposed RNN-T beam search algorithm. We evaluated our proposed system on English videos ASR dataset and show that neural RNN-T models can achieve comparable WER and better computational efficiency compared to a well tuned hybrid ASR baseline.

View on arXiv PDF

Similar