SDASOct 29, 2018

An improved hybrid CTC-Attention model for speech recognition

arXiv:1810.12020v312 citations
Originality Incremental advance
AI Analysis

This work addresses speech recognition accuracy for end-to-end ASR systems, representing an incremental improvement over existing hybrid models.

The paper tackled the problem of improving end-to-end speech recognition by proposing a novel CTC decoder structure and attention smoothing mechanism, achieving a word error rate of 4.43% without language model and 3.34% with RNN-LM on the LibriSpeech test-clean subset, which are the best reported results for this dataset.

Recently, end-to-end speech recognition with a hybrid model consisting of the connectionist temporal classification(CTC) and the attention encoder-decoder achieved state-of-the-art results. In this paper, we propose a novel CTC decoder structure based on the experiments we conducted and explore the relation between decoding performance and the depth of encoder. We also apply attention smoothing mechanism to acquire more context information for subword-based decoding. Taken together, these strategies allow us to achieve a word error rate(WER) of 4.43% without LM and 3.34% with RNN-LM on the test-clean subset of the LibriSpeech corpora, which by far are the best reported WERs for end-to-end ASR systems on this dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes