ASLGSDApr 7, 2021

Capturing Multi-Resolution Context by Dilated Self-Attention

arXiv:2104.02858v11 citations
Originality Incremental advance
AI Analysis

This work addresses computational bottlenecks in self-attention for applications like ASR, offering an incremental improvement by hybridizing existing techniques to enhance efficiency.

The paper tackles the quadratic computational complexity of self-attention in long sequences, such as in automatic speech recognition (ASR), by proposing dilated self-attention, which combines restricted self-attention with a dilation mechanism to capture multi-resolution context. The method achieves similar results to full-sequence self-attention with a fraction of the computational costs, demonstrating substantial improvements over restricted self-attention alone in ASR.

Self-attention has become an important and widely used neural network component that helped to establish new state-of-the-art results for various applications, such as machine translation and automatic speech recognition (ASR). However, the computational complexity of self-attention grows quadratically with the input sequence length. This can be particularly problematic for applications such as ASR, where an input sequence generated from an utterance can be relatively long. In this work, we propose a combination of restricted self-attention and a dilation mechanism, which we refer to as dilated self-attention. The restricted self-attention allows attention to neighboring frames of the query at a high resolution, and the dilation mechanism summarizes distant information to allow attending to it with a lower resolution. Different methods for summarizing distant frames are studied, such as subsampling, mean-pooling, and attention-based pooling. ASR results demonstrate substantial improvements compared to restricted self-attention alone, achieving similar results compared to full-sequence based self-attention with a fraction of the computational costs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes