Towards better decoding and language model integration in sequence to sequence models
This addresses issues in speech recognition for users of seq2seq models, but it is incremental as it builds on existing frameworks.
The paper tackled the problems of overconfidence and incomplete transcriptions in attention-based seq2seq speech recognition systems when using language models, achieving competitive word error rates of 10.6% without a language model and 6.7% with a trigram language model on the Wall Street Journal dataset.
The recently proposed Sequence-to-Sequence (seq2seq) framework advocates replacing complex data processing pipelines, such as an entire automatic speech recognition system, with a single neural network trained in an end-to-end fashion. In this contribution, we analyse an attention-based seq2seq speech recognition system that directly transcribes recordings into characters. We observe two shortcomings: overconfidence in its predictions and a tendency to produce incomplete transcriptions when language models are used. We propose practical solutions to both problems achieving competitive speaker independent word error rates on the Wall Street Journal dataset: without separate language models we reach 10.6% WER, while together with a trigram language model, we reach 6.7% WER.