CL AISep 30, 2022

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

Chendong Zhao, Jianzong Wang, Wen qi Wei, Xiaoyang Qu, Haoqian Wang, Jing Xiao

arXiv:2209.15176v10.83 citationsh-index: 22

Originality Incremental advance

AI Analysis

This work addresses limitations in Transformer-based ASR for online applications, representing an incremental advancement in attention mechanisms for speech recognition.

The paper tackled the problem of applying Transformer self-attention and multi-head attention to streaming ASR by integrating sparse and monotonic attention mechanisms, resulting in effective improvements on widely used benchmarks.

The Transformer architecture model, based on self-attention and multi-head attention, has achieved remarkable success in offline end-to-end Automatic Speech Recognition (ASR). However, self-attention and multi-head attention cannot be easily applied for streaming or online ASR. For self-attention in Transformer ASR, the softmax normalization function-based attention mechanism makes it impossible to highlight important speech information. For multi-head attention in Transformer ASR, it is not easy to model monotonic alignments in different heads. To overcome these two limits, we integrate sparse attention and monotonic attention into Transformer-based ASR. The sparse mechanism introduces a learned sparsity scheme to enable each self-attention structure to fit the corresponding head better. The monotonic attention deploys regularization to prune redundant heads for the multi-head attention structure. The experiments show that our method can effectively improve the attention mechanism on widely used benchmarks of speech recognition.

View on arXiv PDF

Similar