ASCLLGSDApr 1, 2021

Keyword Transformer: A Self-Attention Model for Keyword Spotting

arXiv:2104.00769v3174 citations
AI Analysis

This provides a simpler, high-accuracy model for keyword spotting in speech recognition, though it is incremental as it adapts an existing architecture to a specific domain.

The paper tackled keyword spotting by introducing the Keyword Transformer (KWT), a fully self-attentional architecture that achieved state-of-the-art performance with 98.6% and 97.7% accuracy on Google Speech Commands dataset tasks.

The Transformer architecture has been successful across many domains, including natural language processing, computer vision and speech recognition. In keyword spotting, self-attention has primarily been used on top of convolutional or recurrent encoders. We investigate a range of ways to adapt the Transformer architecture to keyword spotting and introduce the Keyword Transformer (KWT), a fully self-attentional architecture that exceeds state-of-the-art performance across multiple tasks without any pre-training or additional data. Surprisingly, this simple architecture outperforms more complex models that mix convolutional, recurrent and attentive layers. KWT can be used as a drop-in replacement for these models, setting two new benchmark records on the Google Speech Commands dataset with 98.6% and 97.7% accuracy on the 12 and 35-command tasks respectively.

Code Implementations10 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes