LGCLSDASSep 4, 2024

An Analysis of Linear Complexity Attention Substitutes with BEST-RQ

arXiv:2409.02596v13 citationsh-index: 19
Originality Incremental advance
AI Analysis

This work addresses the high computational cost of self-supervised learning for speech processing, offering incremental improvements in efficiency.

The study tackled the computational and memory inefficiency of self-supervised learning in speech processing by replacing quadratic-complexity multi-head self-attention with linear-complexity alternatives like HyperMixing and Mamba, resulting in competitive performance while reducing VRAM consumption by 20-60% and increasing speed by 7-65% for sequences of 20-80 seconds.

Self-Supervised Learning (SSL) has proven to be effective in various domains, including speech processing. However, SSL is computationally and memory expensive. This is in part due the quadratic complexity of multi-head self-attention (MHSA). Alternatives for MHSA have been proposed and used in the speech domain, but have yet to be investigated properly in an SSL setting. In this work, we study the effects of replacing MHSA with recent state-of-the-art alternatives that have linear complexity, namely, HyperMixing, Fastformer, SummaryMixing, and Mamba. We evaluate these methods by looking at the speed, the amount of VRAM consumed, and the performance on the SSL MP3S benchmark. Results show that these linear alternatives maintain competitive performance compared to MHSA while, on average, decreasing VRAM consumption by around 20% to 60% and increasing speed from 7% to 65% for input sequences ranging from 20 to 80 seconds.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes