CLFeb 9, 2023

Leveraging supplementary text data to kick-start automatic speech recognition system development with limited transcriptions

Nay San, Martijn Bartelds, Blaine Billings, Ella de Falco, Hendi Feriza, Johan Safri, Wawan Sahrozi, Ben Foley, Bradley McDonnell, Dan Jurafsky

Stanford

arXiv:2302.04975v128.9288 citationsh-index: 109

Originality Incremental advance

AI Analysis

This work addresses the challenge of building ASR systems for low-resource languages with limited transcriptions, though it is incremental as it builds on prior research on fine-tuning transformers.

The study investigated whether limited transcribed speech (10 minutes) can be used to develop automatic speech recognition systems by leveraging supplementary text data, finding that using lexica and language models from around 80k tokens reduced word error rates to 39% on average, suggesting promise for achieving near 30% WER with minimal speech data.

Recent research using pre-trained transformer models suggests that just 10 minutes of transcribed speech may be enough to fine-tune such a model for automatic speech recognition (ASR) -- at least if we can also leverage vast amounts of text data (803 million tokens). But is that much text data necessary? We study the use of different amounts of text data, both for creating a lexicon that constrains ASR decoding to possible words (e.g. *dogz vs. dogs), and for training larger language models that bias the system toward probable word sequences (e.g. too dogs vs. two dogs). We perform experiments using 10 minutes of transcribed speech from English (for replicating prior work) and two additional pairs of languages differing in the availability of supplemental text data: Gronings and Frisian (~7.5M token corpora available), and Besemah and Nasal (only small lexica available). For all languages, we found that using only a lexicon did not appreciably improve ASR performance. For Gronings and Frisian, we found that lexica and language models derived from 'novel-length' 80k token subcorpora reduced the word error rate (WER) to 39% on average. Our findings suggest that where a text corpus in the upper tens of thousands of tokens or more is available, fine-tuning a transformer model with just tens of minutes of transcribed speech holds some promise towards obtaining human-correctable transcriptions near the 30% WER rule-of-thumb.

View on arXiv PDF

Similar