ASCLMay 18, 2020

Weak-Attention Suppression For Transformer Based Speech Recognition

arXiv:2005.09137v121 citations
Originality Incremental advance
AI Analysis

This work improves speech recognition accuracy for streaming applications by adapting transformer attention mechanisms to handle acoustic data, representing an incremental advance in domain-specific ASR models.

The paper tackled the problem of applying transformers to automatic speech recognition by addressing the mismatch between acoustic frames and text units, proposing Weak-Attention Suppression to induce sparsity in attention, which reduced Word Error Rate by 10% on test-clean and 5% on test-other for streaming models on LibriSpeech.

Transformers, originally proposed for natural language processing (NLP) tasks, have recently achieved great success in automatic speech recognition (ASR). However, adjacent acoustic units (i.e., frames) are highly correlated, and long-distance dependencies between them are weak, unlike text units. It suggests that ASR will likely benefit from sparse and localized attention. In this paper, we propose Weak-Attention Suppression (WAS), a method that dynamically induces sparsity in attention probabilities. We demonstrate that WAS leads to consistent Word Error Rate (WER) improvement over strong transformer baselines. On the widely used LibriSpeech benchmark, our proposed method reduced WER by 10%$ on test-clean and 5% on test-other for streamable transformers, resulting in a new state-of-the-art among streaming models. Further analysis shows that WAS learns to suppress attention of non-critical and redundant continuous acoustic frames, and is more likely to suppress past frames rather than future ones. It indicates the importance of lookahead in attention-based ASR models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes