SDAIASSep 11, 2024

Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition

Cambridge
arXiv:2409.07165v11 citationsh-index: 13
Originality Incremental advance
AI Analysis

This work addresses the computational inefficiency of ASR systems, enabling deployment on constrained devices, though it is incremental as it builds on existing SummaryMixing.

The authors tackled the quadratic time complexity of self-attention in speech recognition by extending SummaryMixing to a Conformer Transducer for streaming and offline modes, achieving better accuracy with less compute and memory.

Automatic speech recognition (ASR) with an encoder equipped with self-attention, whether streaming or non-streaming, takes quadratic time in the length of the speech utterance. This slows down training and decoding, increase their cost, and limit the deployment of the ASR in constrained devices. SummaryMixing is a promising linear-time complexity alternative to self-attention for non-streaming speech recognition that, for the first time, preserves or outperforms the accuracy of self-attention models. Unfortunately, the original definition of SummaryMixing is not suited to streaming speech recognition. Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. It shows that this new linear-time complexity speech encoder outperforms self-attention in both scenarios while requiring less compute and memory during training and decoding.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes