CLAIASJul 18, 2024

Linear-Complexity Self-Supervised Learning for Speech Processing

Cambridge
arXiv:2407.13377v11 citationsh-index: 13Has Code
Originality Incremental advance
AI Analysis

This work addresses the scalability issue of SSL models for speech processing, making them more accessible by lowering hardware requirements, though it is incremental as it adapts an existing linear-complexity method to SSL.

The paper tackles the high computational cost of self-supervised learning (SSL) for speech processing by introducing a linear-complexity context encoder, reducing pre-training time by 18% and peak VRAM by 23% while maintaining or improving performance on downstream tasks.

Self-supervised learning (SSL) models usually require weeks of pre-training with dozens of high-end GPUs. These models typically have a multi-headed self-attention (MHSA) context encoder. However, MHSA takes quadratic time and space in the input length, contributing to the high pre-training cost. Linear-complexity alternatives to MHSA have been proposed. For instance, in supervised training, the SummaryMixing model is the first to outperform MHSA across multiple speech processing tasks. However, these cheaper alternatives have not been explored for SSL yet. This paper studies a linear-complexity context encoder for SSL for the first time. With better or equivalent performance for the downstream tasks of the MP3S benchmark, SummaryMixing reduces the pre-training time and peak VRAM of wav2vec 2.0 model by 18% and by 23%, respectively, leading to the pre-training of a 155M wav2vec 2.0 model finished within one week with 4 Tesla A100 GPUs. Code is available at https://github.com/SamsungLabs/SummaryMixing.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes