AS CL LG SDJul 27, 2020

Efficient minimum word error rate training of RNN-Transducer for end-to-end speech recognition

Jinxi Guo, Gautam Tiwari, Jasha Droppo, Maarten Van Segbroeck, Che-Wei Huang, Andreas Stolcke, Roland Maas

arXiv:2007.13802v117.657 citationsh-index: 66

Originality Incremental advance

AI Analysis

This work addresses efficiency and accuracy issues in end-to-end speech recognition for applications like far-field recordings and music-domain utterances, though it is incremental as it builds on existing MWER training methods.

The paper tackles the problem of slow and inefficient minimum word error rate (MWER) training for RNN-Transducer in speech recognition by proposing a novel method that recalculates alignment scores for N-best lists using the forward-backward algorithm, speeding up training by 6 times and achieving a 3.6% WER improvement over a baseline.

In this work, we propose a novel and efficient minimum word error rate (MWER) training method for RNN-Transducer (RNN-T). Unlike previous work on this topic, which performs on-the-fly limited-size beam-search decoding and generates alignment scores for expected edit-distance computation, in our proposed method, we re-calculate and sum scores of all the possible alignments for each hypothesis in N-best lists. The hypothesis probability scores and back-propagated gradients are calculated efficiently using the forward-backward algorithm. Moreover, the proposed method allows us to decouple the decoding and training processes, and thus we can perform offline parallel-decoding and MWER training for each subset iteratively. Experimental results show that this proposed semi-on-the-fly method can speed up the on-the-fly method by 6 times and result in a similar WER improvement (3.6%) over a baseline RNN-T model. The proposed MWER training can also effectively reduce high-deletion errors (9.2% WER-reduction) introduced by RNN-T models when EOS is added for endpointer. Further improvement can be achieved if we use a proposed RNN-T rescoring method to re-rank hypotheses and use external RNN-LM to perform additional rescoring. The best system achieves a 5% relative improvement on an English test-set of real far-field recordings and a 11.6% WER reduction on music-domain utterances.

View on arXiv PDF

Similar