AS CLMay 28, 2020

On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition

Jinyu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, Shujie Liu

arXiv:2005.14327v225.7142 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This provides a comprehensive empirical benchmark for practitioners choosing end-to-end models in large-scale speech recognition, though it is incremental as it compares existing methods.

The paper compared three end-to-end speech recognition models (RNN-T, RNN-AED, and Transformer-AED) using 65,000 hours of training data, finding that Transformer-AED achieved the best accuracy in both streaming and non-streaming modes, and that streaming RNN-T and Transformer-AED outperformed a highly-optimized hybrid model.

Recently, there has been a strong push to transition from hybrid models to end-to-end (E2E) models for automatic speech recognition. Currently, there are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attention-based encoder-decoder (AED), and Transformer-AED. In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non-streaming and streaming modes. We use 65 thousand hours of Microsoft anonymized training data to train these models. As E2E models are more data hungry, it is better to compare their effectiveness with large amount of training data. To the best of our knowledge, no such comprehensive study has been conducted yet. We show that although AED models are stronger than RNN-T in the non-streaming mode, RNN-T is very competitive in streaming mode if its encoder can be properly initialized. Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode. We show that both streaming RNN-T and transformer-AED models can obtain better accuracy than a highly-optimized hybrid model.

View on arXiv PDF Code

Similar