SD AI ASSep 11, 2024

Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition

Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya

Cambridge

arXiv:2409.07165v14.91 citationsh-index: 19Has Code

Originality Incremental advance

AI Analysis

This work addresses the computational inefficiency of ASR systems, enabling deployment on constrained devices, though it is incremental as it builds on existing SummaryMixing.

The authors tackled the quadratic time complexity of self-attention in speech recognition by extending SummaryMixing to a Conformer Transducer for streaming and offline modes, achieving better accuracy with less compute and memory.

Automatic speech recognition (ASR) with an encoder equipped with self-attention, whether streaming or non-streaming, takes quadratic time in the length of the speech utterance. This slows down training and decoding, increase their cost, and limit the deployment of the ASR in constrained devices. SummaryMixing is a promising linear-time complexity alternative to self-attention for non-streaming speech recognition that, for the first time, preserves or outperforms the accuracy of self-attention models. Unfortunately, the original definition of SummaryMixing is not suited to streaming speech recognition. Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. It shows that this new linear-time complexity speech encoder outperforms self-attention in both scenarios while requiring less compute and memory during training and decoding.

View on arXiv PDF Code

Similar