CLLGNEMLAug 5, 2015

Listen, Attend and Spell

arXiv:1508.01211v22436 citations
AI Analysis

This addresses speech transcription for applications like voice search, offering an end-to-end approach that is competitive with but not surpassing state-of-the-art HMM models.

The paper tackles speech recognition by introducing Listen, Attend and Spell (LAS), a neural network that transcribes speech to characters without independence assumptions, achieving a word error rate of 14.1% without a language model and 10.3% with rescoring on a Google voice search subset.

We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits characters as outputs. The network produces character sequences without making any independence assumptions between the characters. This is the key improvement of LAS over previous end-to-end CTC models. On a subset of the Google voice search task, LAS achieves a word error rate (WER) of 14.1% without a dictionary or a language model, and 10.3% with language model rescoring over the top 32 beams. By comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0%.

Code Implementations40 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes