Linguistically Informed Tokenization Improves ASR for Underresourced Languages
This work addresses the challenge of making ASR usable for underresourced languages, which is crucial for linguists in language documentation tasks, though it is incremental as it applies an existing method with a novel tokenization strategy.
The researchers tackled the problem of automatic speech recognition (ASR) for underresourced languages by fine-tuning a wav2vec2 model on Yan-nhangu, an Indigenous Australian language, and found that using a linguistically informed phonemic tokenization system substantially improved word error rate (WER) and character error rate (CER) compared to a baseline orthographic tokenization scheme.
Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems use data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec2 ASR model on Yan-nhangu, a dormant Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR's viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves WER and CER compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can work for underresourced languages.