AS AI CL LG SDMay 21, 2023

Multi-Head State Space Model for Speech Recognition

Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, Mark J. F. Gales

arXiv:2305.12498v210.820 citations

Originality Highly original

AI Analysis

This work addresses speech recognition accuracy for applications like transcription, offering a novel method that improves over existing transformer-based approaches.

The paper tackles speech recognition by proposing a multi-head state space model (MH-SSM) as a drop-in replacement for multi-head attention in transformers, achieving state-of-the-art performance on LibriSpeech with word error rates of 1.76%/4.37% on development and 1.91%/4.36% on test sets without external language models.

State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus. Furthermore, we augment the transformer block with MH-SSMs layers, referred to as the Stateformer, achieving state-of-the-art performance on the LibriSpeech task, with word error rates of 1.76\%/4.37\% on the development and 1.91\%/4.36\% on the test sets without using an external language model.

View on arXiv PDF

Similar