ASAICLLGSDMay 21, 2023

Multi-Head State Space Model for Speech Recognition

arXiv:2305.12498v220 citations
Originality Highly original
AI Analysis

This work addresses speech recognition accuracy for applications like transcription, offering a novel method that improves over existing transformer-based approaches.

The paper tackles speech recognition by proposing a multi-head state space model (MH-SSM) as a drop-in replacement for multi-head attention in transformers, achieving state-of-the-art performance on LibriSpeech with word error rates of 1.76%/4.37% on development and 1.91%/4.36% on test sets without external language models.

State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus. Furthermore, we augment the transformer block with MH-SSMs layers, referred to as the Stateformer, achieving state-of-the-art performance on the LibriSpeech task, with word error rates of 1.76\%/4.37\% on the development and 1.91\%/4.36\% on the test sets without using an external language model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes