CLOct 28, 2017

A Study of All-Convolutional Encoders for Connectionist Temporal Classification

arXiv:1710.10398v211 citations
Originality Incremental advance
AI Analysis

This work addresses the computational efficiency problem in speech recognition for researchers and practitioners, though it is incremental as it adapts existing methods without surpassing RNN performance.

The study investigated replacing recurrent neural networks (RNNs) with convolutional neural networks (CNNs) as encoders in connectionist temporal classification for automatic speech recognition, finding that CNN-based models achieved close performance to LSTMs while being significantly faster in training and decoding.

Connectionist temporal classification (CTC) is a popular sequence prediction approach for automatic speech recognition that is typically used with models based on recurrent neural networks (RNNs). We explore whether deep convolutional neural networks (CNNs) can be used effectively instead of RNNs as the "encoder" in CTC. CNNs lack an explicit representation of the entire sequence, but have the advantage that they are much faster to train. We present an exploration of CNNs as encoders for CTC models, in the context of character-based (lexicon-free) automatic speech recognition. In particular, we explore a range of one-dimensional convolutional layers, which are particularly efficient. We compare the performance of our CNN-based models against typical RNNbased models in terms of training time, decoding time, model size and word error rate (WER) on the Switchboard Eval2000 corpus. We find that our CNN-based models are close in performance to LSTMs, while not matching them, and are much faster to train and decode.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes