CLAIOct 4, 2025

Towards Unsupervised Speech Recognition at the Syllable-Level

arXiv:2510.03639v11 citationsh-index: 44
Originality Highly original
AI Analysis

This addresses the problem of extending speech recognition to low-resource languages and non-parallel data, representing a strong specific gain rather than a broad paradigm shift.

The paper tackles unsupervised speech recognition by proposing a syllable-level framework that avoids grapheme-to-phoneme converters and GAN instability, achieving up to a 40% relative reduction in character error rate on LibriSpeech and effective generalization to Mandarin.

Training speech recognizers with unpaired speech and text -- known as unsupervised speech recognition (UASR) -- is a crucial step toward extending ASR to low-resource languages in the long-tail distribution and enabling multimodal learning from non-parallel data. However, existing approaches based on phones often rely on costly resources such as grapheme-to-phoneme converters (G2Ps) and struggle to generalize to languages with ambiguous phoneme boundaries due to training instability. In this paper, we address both challenges by introducing a syllable-level UASR framework based on masked language modeling, which avoids the need for G2P and the instability of GAN-based methods. Our approach achieves up to a 40\% relative reduction in character error rate (CER) on LibriSpeech and generalizes effectively to Mandarin, a language that has remained particularly difficult for prior methods. Code will be released upon acceptance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes