CLSDASSep 15, 2023

Augmenting conformers with structured state-space sequence models for online speech recognition

arXiv:2309.08551v29 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses the problem of efficient online ASR for real-time applications, presenting an incremental improvement over existing methods.

The paper tackled online speech recognition by augmenting neural encoders with structured state-space sequence models (S4) to efficiently access long left context, achieving word error rates of 4.01%/8.53% on Librispeech test sets and outperforming tuned Conformers.

Online speech recognition, where the model only accesses context to the left, is an important and challenging use case for ASR systems. In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4), a family of models that provide a parameter-efficient way of accessing arbitrarily long left context. We performed systematic ablation studies to compare variants of S4 models and propose two novel approaches that combine them with convolutions. We found that the most effective design is to stack a small S4 using real-valued recurrent weights with a local convolution, allowing them to work complementarily. Our best model achieves WERs of 4.01%/8.53% on test sets from Librispeech, outperforming Conformers with extensively tuned convolution.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes