CLLGMLMay 8, 2018

Improved training of end-to-end attention models for speech recognition

arXiv:1805.03294v1280 citations
Originality Incremental advance
AI Analysis

This work improves speech recognition accuracy for applications like transcription, though it is incremental with novel training techniques.

The authors tackled training end-to-end attention models for speech recognition, achieving state-of-the-art word error rates of 3.54% on dev-clean and 3.82% on test-clean subsets of LibriSpeech, with up to 27% relative improvement using language model fusion.

Sequence-to-sequence attention-based models on subword units allow simple open-vocabulary end-to-end speech recognition. In this work, we show that such models can achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks. In particular, we report the state-of-the-art word error rates (WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets of LibriSpeech. We introduce a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance. In some experiments, we also use an auxiliary CTC loss function to help the convergence. In addition, we train long short-term memory (LSTM) language models on subword units. By shallow fusion, we report up to 27% relative improvements in WER over the attention baseline without a language model.

Code Implementations14 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes