CLASMay 1, 2020

Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition

arXiv:2005.00572v127 citations
Originality Incremental advance
AI Analysis

This work addresses training challenges for RNN-T models in speech recognition, offering incremental improvements over existing initialization strategies.

The authors tackled the difficulty of training RNN-T models for speech recognition by proposing pre-training methods using external alignments, achieving a 10% relative word error rate reduction compared to random initialization and an 8% reduction compared to CTC+RNNLM initialization on a 65,000-hour dataset.

Recently, the recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research due to its advantages of being capable for online streaming speech recognition. However, RNN-T training is made difficult by the huge memory requirements, and complicated neural structure. A common solution to ease the RNN-T training is to employ connectionist temporal classification (CTC) model along with RNN language model (RNNLM) to initialize the RNN-T parameters. In this work, we conversely leverage external alignments to seed the RNN-T model. Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively. Evaluated on Microsoft 65,000 hours anonymized production data with personally identifiable information removed, our proposed methods can obtain significant improvement. In particular, the encoder pre-training solution achieved a 10% and a 8% relative word error rate reduction when compared with random initialization and the widely used CTC+RNNLM initialization strategy, respectively. Our solutions also significantly reduce the RNN-T model latency from the baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes