ASCLLGJul 3, 2024

Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

arXiv:2407.03495v116 citationsh-index: 15
AI Analysis

This work addresses improving ASR performance and efficiency for multilingual applications, though it appears incremental as it builds on existing discrete representation methods.

The paper tackles building automatic speech recognition (ASR) systems using discrete speech representations, resulting in a pipeline that outperforms Encodec at similar bit-rates and surpasses state-of-the-art self-supervised models on the 143-language ML-SUPERB benchmark with smaller size and less pretraining data.

Discrete speech representations have garnered recent attention for their efficacy in training transformer-based models for various speech-related tasks such as automatic speech recognition (ASR), translation, speaker verification, and joint speech-text foundational models. In this work, we present a comprehensive analysis on building ASR systems with discrete codes. We investigate different methods for codec training such as quantization schemes and time-domain vs spectral feature encodings. We further explore ASR training techniques aimed at enhancing performance, training efficiency, and noise robustness. Drawing upon our findings, we introduce a codec ASR pipeline that outperforms Encodec at similar bit-rate. Remarkably, it also surpasses the state-of-the-art results achieved by strong self-supervised models on the 143 languages ML-SUPERB benchmark despite being smaller in size and pretrained on significantly less data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes