AS CLMay 16, 2020

Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory

Chunyang Wu, Yongqiang Wang, Yangyang Shi, Ching-Feng Yeh, Frank Zhang

arXiv:2005.08042v118.364 citationsh-index: 39

Originality Highly original

AI Analysis

This addresses the challenge of deploying Transformer models in real-time speech recognition systems, which is an incremental improvement over prior streaming methods.

The paper tackled the problem of making Transformer-based acoustic models suitable for streaming applications by proposing a novel augmented memory self-attention mechanism that attends to short input segments and a memory bank. The result was a 15% relative error reduction on the Librispeech benchmark compared to the LC-BLSTM baseline, outperforming existing streamable Transformer methods.

Transformer-based acoustic modeling has achieved great suc-cess for both hybrid and sequence-to-sequence speech recogni-tion. However, it requires access to the full sequence, and thecomputational cost grows quadratically with respect to the in-put sequence length. These factors limit its adoption for stream-ing applications. In this work, we proposed a novel augmentedmemory self-attention, which attends on a short segment of theinput sequence and a bank of memories. The memory bankstores the embedding information for all the processed seg-ments. On the librispeech benchmark, our proposed methodoutperforms all the existing streamable transformer methods bya large margin and achieved over 15% relative error reduction,compared with the widely used LC-BLSTM baseline. Our find-ings are also confirmed on some large internal datasets.

View on arXiv PDF

Similar