CLLGSDASOct 26, 2022

Monotonic segmental attention for automatic speech recognition

arXiv:2210.14742v111 citationsh-index: 104
Originality Incremental advance
AI Analysis

This work addresses efficiency and generalization issues in ASR for streaming applications, representing an incremental improvement over existing monotonic attention approaches.

The paper tackles the problem of quadratic runtime in global attention for automatic speech recognition by introducing a segmental-attention model, which performs better than global-attention and generalizes well to long sequences up to several minutes.

We introduce a novel segmental-attention model for automatic speech recognition. We restrict the decoder attention to segments to avoid quadratic runtime of global attention, better generalize to long sequences, and eventually enable streaming. We directly compare global-attention and different segmental-attention modeling variants. We develop and compare two separate time-synchronous decoders, one specifically taking the segmental nature into account, yielding further improvements. Using time-synchronous decoding for segmental models is novel and a step towards streaming applications. Our experiments show the importance of a length model to predict the segment boundaries. The final best segmental-attention model using segmental decoding performs better than global-attention, in contrast to other monotonic attention approaches in the literature. Further, we observe that the segmental model generalizes much better to long sequences of up to several minutes.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes