CL LG SD ASNov 13, 2018

An Online Attention-based Model for Speech Recognition

Ruchao Fan, Pan Zhou, Wei Chen, Jia Jia, Gang Liu

arXiv:1811.05247v25.848 citations

Originality Incremental advance

AI Analysis

This work addresses the need for real-time speech recognition systems, which is crucial for applications like voice assistants and live transcription, though it is incremental as it builds on existing LAS models.

The authors tackled the problem of making attention-based end-to-end speech recognition models suitable for real-time applications by addressing the latency issues of bidirectional encoders and global soft attention. They proposed an online model, LC-AMoChA, which achieved only a 3.5% relative performance reduction compared to the baseline on a Mandarin corpus.

Attention-based end-to-end models such as Listen, Attend and Spell (LAS), simplify the whole pipeline of traditional automatic speech recognition (ASR) systems and become popular in the field of speech recognition. In previous work, researchers have shown that such architectures can acquire comparable results to state-of-the-art ASR systems, especially when using a bidirectional encoder and global soft attention (GSA) mechanism. However, bidirectional encoder and GSA are two obstacles for real-time speech recognition. In this work, we aim to stream LAS baseline by removing the above two obstacles. On the encoder side, we use a latency-controlled (LC) bidirectional structure to reduce the delay of forward computation. Meanwhile, an adaptive monotonic chunk-wise attention (AMoChA) mechanism is proposed to replace GSA for the calculation of attention weight distribution. Furthermore, we propose two methods to alleviate the huge performance degradation when combining LC and AMoChA. Finally, we successfully acquire an online LAS model, LC-AMoChA, which has only 3.5% relative performance reduction to LAS baseline on our internal Mandarin corpus.

View on arXiv PDF

Similar