Relaxed Attention: A Simple Method to Boost Performance of End-to-End Automatic Speech Recognition
This addresses performance issues in end-to-end automatic speech recognition systems, particularly for researchers and practitioners, but it is incremental as it builds on existing attention-based models with a minor modification.
The paper tackles overconfidence in attention-based encoder-decoder models for automatic speech recognition by introducing relaxed attention, a simple method that injects a uniform distribution into attention weights during training. This approach achieved a new benchmark with a word error rate of 3.65% on WSJ, outperforming the previous state of the art by 13.1% relative.
Recently, attention-based encoder-decoder (AED) models have shown high performance for end-to-end automatic speech recognition (ASR) across several tasks. Addressing overconfidence in such models, in this paper we introduce the concept of relaxed attention, which is a simple gradual injection of a uniform distribution to the encoder-decoder attention weights during training that is easily implemented with two lines of code. We investigate the effect of relaxed attention across different AED model architectures and two prominent ASR tasks, Wall Street Journal (WSJ) and Librispeech. We found that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models. On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art (4.20%) by 13.1% relative, while introducing only a single hyperparameter.