CLApr 4, 2019

Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

arXiv:1904.02619v1108 citations
Originality Highly original
AI Analysis

This addresses speech recognition efficiency and accuracy for noisy audio data, representing a strong incremental improvement over existing sequence-to-sequence methods.

The authors tackled speech recognition by proposing a fully convolutional sequence-to-sequence model with time-depth separable convolutions, achieving a 22% relative WER improvement on the noisy LibriSpeech test set while being an order of magnitude more efficient than RNN baselines.

We propose a fully convolutional sequence-to-sequence encoder architecture with a simple and efficient decoder. Our model improves WER on LibriSpeech while being an order of magnitude more efficient than a strong RNN baseline. Key to our approach is a time-depth separable convolution block which dramatically reduces the number of parameters in the model while keeping the receptive field large. We also give a stable and efficient beam search inference procedure which allows us to effectively integrate a language model. Coupled with a convolutional language model, our time-depth separable convolution architecture improves by more than 22% relative WER over the best previously reported sequence-to-sequence results on the noisy LibriSpeech test set.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes