AS LG SDNov 4, 2022

Multi-blank Transducers for Speech Recognition

Hainan Xu, Fei Jia, Somshubra Majumdar, Shinji Watanabe, Boris Ginsburg

arXiv:2211.03541v29.213 citationsh-index: 83Has Code

Originality Incremental advance

AI Analysis

This work addresses inference speed bottlenecks in speech recognition systems, offering significant practical improvements for deployment, though it is incremental as it modifies an existing model paradigm.

This paper tackles the problem of slow inference in RNN-Transducer models for automatic speech recognition by introducing multi-blank symbols that consume multiple input frames, resulting in relative speedups of over 90% for English and 139% for German datasets while also improving accuracy.

This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training multi-blank RNN-Ts, we propose a novel logit under-normalization method in order to prioritize emissions of big blanks. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139% to model inference for English Librispeech and German Multilingual Librispeech datasets, respectively. The multi-blank RNN-T method also improves ASR accuracy consistently. We will release our implementation of the method in the NeMo (https://github.com/NVIDIA/NeMo) toolkit.

View on arXiv PDF Code

Similar