CL AIOct 4, 2025

Towards Unsupervised Speech Recognition at the Syllable-Level

Liming Wang, Junrui Ni, Kai-Wei Chang, Saurabhchand Bhati, David Harwath, Mark Hasegawa-Johnson, James R. Glass

arXiv:2510.03639v14.91 citationsh-index: 44

Originality Highly original

AI Analysis

This addresses the problem of extending speech recognition to low-resource languages and non-parallel data, representing a strong specific gain rather than a broad paradigm shift.

The paper tackles unsupervised speech recognition by proposing a syllable-level framework that avoids grapheme-to-phoneme converters and GAN instability, achieving up to a 40% relative reduction in character error rate on LibriSpeech and effective generalization to Mandarin.

Training speech recognizers with unpaired speech and text -- known as unsupervised speech recognition (UASR) -- is a crucial step toward extending ASR to low-resource languages in the long-tail distribution and enabling multimodal learning from non-parallel data. However, existing approaches based on phones often rely on costly resources such as grapheme-to-phoneme converters (G2Ps) and struggle to generalize to languages with ambiguous phoneme boundaries due to training instability. In this paper, we address both challenges by introducing a syllable-level UASR framework based on masked language modeling, which avoids the need for G2P and the instability of GAN-based methods. Our approach achieves up to a 40\% relative reduction in character error rate (CER) on LibriSpeech and generalizes effectively to Mandarin, a language that has remained particularly difficult for prior methods. Code will be released upon acceptance.

View on arXiv PDF

Similar