CLSDASSep 13, 2024

Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages

arXiv:2409.08872v21 citationsh-index: 3
Originality Incremental advance
AI Analysis

It addresses the challenge of ASR for extremely low-resource and endangered languages, which is incremental but domain-specific.

This study tackled the problem of low-resource automatic speech recognition for endangered languages by proposing a novel data-selection scheme that uses a multilingual corpus to augment data, resulting in substantial improvements in ASR performance for Amis and Seediq languages.

This study investigates the efficacy of data augmentation techniques for low-resource automatic speech recognition (ASR), focusing on two endangered Austronesian languages, Amis and Seediq. Recognizing the potential of self-supervised learning (SSL) in low-resource settings, we explore the impact of data volume on the continued pre-training of SSL models. We propose a novel data-selection scheme leveraging a multilingual corpus to augment the limited target language data. This scheme utilizes a language classifier to extract utterance embeddings and employs one-class classifiers to identify utterances phonetically and phonologically proximate to the target languages. Utterances are ranked and selected based on their decision scores, ensuring the inclusion of highly relevant data in the SSL-ASR pipeline. Our experimental results demonstrate the effectiveness of this approach, yielding substantial improvements in ASR performance for both Amis and Seediq. These findings underscore the feasibility and promise of data augmentation through cross-lingual transfer learning for low-resource language ASR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes