CLAISDASJun 30, 2024

Towards Robust Speech Representation Learning for Thousands of Languages

arXiv:2407.00837v256 citations
AI Analysis

This work addresses the challenge of making speech technologies accessible for thousands of languages, which is a significant step but incremental in extending existing self-supervised learning methods.

The authors tackled the problem of extending self-supervised speech representation learning to thousands of languages by proposing XEUS, a model trained on over 1 million hours of data across 4057 languages, achieving a 4-fold increase in language coverage and setting a new state-of-the-art on the ML-SUPERB benchmark with improvements of 0.8% and 4.4% over existing models.

Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 million hours of speech from existing publicly accessible corpora with a newly created corpus of 7400+ hours from 4057 languages, which will be publicly released. To handle the diverse conditions of multilingual speech data, we augment the typical SSL masked prediction approach with a novel dereverberation objective, increasing robustness. We evaluate XEUS on several benchmarks, and show that it consistently outperforms or achieves comparable results to state-of-the-art (SOTA) SSL models across a variety of tasks. XEUS sets a new SOTA on the ML-SUPERB benchmark: it outperforms MMS 1B and w2v-BERT 2.0 v2 by 0.8% and 4.4% respectively, despite having less parameters or pre-training data. Checkpoints, code, and data are found in https://www.wavlab.org/activities/2024/xeus/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes