AS CLJan 20, 2020

Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard

Zoltán Tüske, George Saon, Kartik Audhkhasi, Brian Kingsbury

arXiv:2001.07263v317.670 citations

Originality Incremental advance

AI Analysis

This addresses speech recognition for applications with moderate data, showing incremental improvements through regularization and model scaling.

The paper tackled the problem of achieving state-of-the-art speech recognition on the Switchboard database with limited data, using a single-headed attention LSTM model, resulting in word error rates of 4.7% and 7.8% on Switchboard and CallHome sets.

It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training. In this paper, we show that state-of-the-art recognition performance can be achieved on the Switchboard-300 database using a single headed attention, LSTM based model. Using a cross-utterance language model, our single-pass speaker independent system reaches 6.4% and 12.5% word error rate (WER) on the Switchboard and CallHome subsets of Hub5'00, without a pronunciation lexicon. While careful regularization and data augmentation are crucial in achieving this level of performance, experiments on Switchboard-2000 show that nothing is more useful than more data. Overall, the combination of various regularizations and a simple but fairly large model results in a new state of the art, 4.7% and 7.8% WER on the Switchboard and CallHome sets, using SWB-2000 without any external data resources.

View on arXiv PDF

Similar