SDCLASMay 30, 2025

Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC

NVIDIA
arXiv:2505.24200v21 citationsh-index: 19INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses performance gaps in multilingual speech processing for tasks like language identification and automatic speech recognition, representing an incremental improvement.

The paper tackled the problem of limited resources during fine-tuning for multilingual speech models, achieving a 14% relative improvement in language identification accuracy and a 30% relative reduction in automatic speech recognition character error rate over the baseline on ML-SUPERB 2.0.

Multilingual speech processing with self-supervised or supervised pre-trained Speech Foundation Models (SFM) has achieved strong performance on tasks like Language Identification (LID) and Automatic Speech Recognition (ASR). However, these models struggle with limited resources during fine-tuning. This paper enhances multilingual LID and ASR on ML-SUPERB 2.0 by exploring multiple strategies for adapting SFMs, including frozen upstream training, partial fine-tuning, and low-rank adaptation. Furthermore, we employ data augmentation to mitigate performance gaps in few-shot settings and introduce LID Connectionist Temporal Classification (CTC) loss for regularization. Our approach achieves a 14% relative improvement in LID accuracy and a 30% relative reduction in ASR CER over the baseline on ML-SUPERB 2.0, securing second place in the Interspeech 2025 ML-SUPERB 2.0 Challenge.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes