CL ASSep 26, 2019

Improving RNN Transducer Modeling for End-to-End Speech Recognition

arXiv:1909.12415v1179 citations

Originality Incremental advance

AI Analysis

This work addresses efficient and accurate speech recognition for practical applications, offering incremental improvements in model size and training speed.

The paper tackles improving RNN Transducer (RNN-T) models for end-to-end speech recognition by optimizing training for memory efficiency and proposing better model structures, resulting in a model with 216 MB size achieving up to 11.8% relative WER reduction from baseline and competitive performance with larger server models.

In the last few years, an emerging trend in automatic speech recognition research is the study of end-to-end (E2E) systems. Connectionist Temporal Classification (CTC), Attention Encoder-Decoder (AED), and RNN Transducer (RNN-T) are the most popular three methods. Among these three methods, RNN-T has the advantages to do online streaming which is challenging to AED and it doesn't have CTC's frame-independence assumption. In this paper, we improve the RNN-T training in two aspects. First, we optimize the training algorithm of RNN-T to reduce the memory consumption so that we can have larger training minibatch for faster training speed. Second, we propose better model structures so that we obtain RNN-T models with the very good accuracy but small footprint. Trained with 30 thousand hours anonymized and transcribed Microsoft production data, the best RNN-T model with even smaller model size (216 Megabytes) achieves up-to 11.8% relative word error rate (WER) reduction from the baseline RNN-T model. This best RNN-T model is significantly better than the device hybrid model with similar size by achieving up-to 15.0% relative WER reduction, and obtains similar WERs as the server hybrid model of 5120 Megabytes in size.

View on arXiv PDF

Similar