CLFeb 18, 2021

Echo State Speech Recognition

Harsh Shrivastava, Ankush Garg, Yuan Cao, Yu Zhang, Tara Sainath

arXiv:2102.09114v12.223 citations

Originality Incremental advance

AI Analysis

This work addresses training efficiency for ASR practitioners, but it is incremental as it builds on existing echo state network concepts applied to specific models.

The authors tackled the problem of training efficiency in automatic speech recognition (ASR) by proposing models with randomly initialized and untrained decoder layers, showing that model quality does not drop even with fully randomized decoders, enabling more efficient training and storage.

We propose automatic speech recognition (ASR) models inspired by echo state network (ESN), in which a subset of recurrent neural networks (RNN) layers in the models are randomly initialized and untrained. Our study focuses on RNN-T and Conformer models, and we show that model quality does not drop even when the decoder is fully randomized. Furthermore, such models can be trained more efficiently as the decoders do not require to be updated. By contrast, randomizing encoders hurts model quality, indicating that optimizing encoders and learn proper representations for acoustic inputs are more vital for speech recognition. Overall, we challenge the common practice of training ASR models for all components, and demonstrate that ESN-based models can perform equally well but enable more efficient training and storage than fully-trainable counterparts.

View on arXiv PDF

Similar