CLFeb 9, 2023

Leveraging supplementary text data to kick-start automatic speech recognition system development with limited transcriptions

Stanford
arXiv:2302.04975v1288 citationsh-index: 109
Originality Incremental advance
AI Analysis

This work addresses the challenge of building ASR systems for low-resource languages with limited transcriptions, though it is incremental as it builds on prior research on fine-tuning transformers.

The study investigated whether limited transcribed speech (10 minutes) can be used to develop automatic speech recognition systems by leveraging supplementary text data, finding that using lexica and language models from around 80k tokens reduced word error rates to 39% on average, suggesting promise for achieving near 30% WER with minimal speech data.

Recent research using pre-trained transformer models suggests that just 10 minutes of transcribed speech may be enough to fine-tune such a model for automatic speech recognition (ASR) -- at least if we can also leverage vast amounts of text data (803 million tokens). But is that much text data necessary? We study the use of different amounts of text data, both for creating a lexicon that constrains ASR decoding to possible words (e.g. *dogz vs. dogs), and for training larger language models that bias the system toward probable word sequences (e.g. too dogs vs. two dogs). We perform experiments using 10 minutes of transcribed speech from English (for replicating prior work) and two additional pairs of languages differing in the availability of supplemental text data: Gronings and Frisian (~7.5M token corpora available), and Besemah and Nasal (only small lexica available). For all languages, we found that using only a lexicon did not appreciably improve ASR performance. For Gronings and Frisian, we found that lexica and language models derived from 'novel-length' 80k token subcorpora reduced the word error rate (WER) to 39% on average. Our findings suggest that where a text corpus in the upper tens of thousands of tokens or more is available, fine-tuning a transformer model with just tens of minutes of transcribed speech holds some promise towards obtaining human-correctable transcriptions near the 30% WER rule-of-thumb.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes