CLSDASJun 12, 2024

Towards Unsupervised Speech Recognition Without Pronunciation Models

arXiv:2406.08380v23 citations
Originality Highly original
AI Analysis

This addresses the problem of developing ASR for languages lacking transcribed data, though it is incremental as it builds on prior unsupervised models.

The paper tackles unsupervised speech recognition without paired speech-text data or pronunciation models by proposing a word-level approach using joint speech-to-speech and text-to-text masked token-infilling, achieving a word error rate of 20-23% on a curated English corpus without parallel transcripts or lexicon.

Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech and text data to effectively train these systems. In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. We explore a new research direction: word-level unsupervised ASR, and experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. Using a curated speech corpus containing a fixed number of English words, our system iteratively refines the word segmentation structure and achieves a word error rate of between 20-23%, depending on the vocabulary size, without parallel transcripts, oracle word boundaries, or a pronunciation lexicon. This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes