ASLGSDMar 31, 2019

Learning Shared Encoding Representation for End-to-End Speech Recognition Models

arXiv:1904.02147v15 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of training efficient end-to-end speech recognition models, offering incremental improvements for the speech processing community.

The paper tackles the complexity of optimizing CTC models for end-to-end speech recognition by learning a shared encoding representation through multi-task training, resulting in significant improvement over plain-task training. It also uses this representation to initialize attention-based models, achieving word-error-rates of 12.2% and 22.6% on Switchboard and CallHome datasets.

In this work, we learn a shared encoding representation for a multi-task neural network model optimized with connectionist temporal classification (CTC) and conventional framewise cross-entropy training criteria. Our experiments show that the multi-task training not only tackles the complexity of optimizing CTC models such as acoustic-to-word but also results in significant improvement compared to the plain-task training with an optimal setup. Furthermore, we propose to use the encoding representation learned by the multi-task network to initialize the encoder of attention-based models. Thereby, we train a deep attention-based end-to-end model with 10 long short-term memory (LSTM) layers of encoder which produces 12.2\% and 22.6\% word-error-rate on Switchboard and CallHome subsets of the Hub5 2000 evaluation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes