ASCLSDSep 22, 2023

Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models

arXiv:2309.12763v24 citationsh-index: 39
Originality Incremental advance
AI Analysis

This work addresses data scarcity in speech processing for low-resource languages, offering an incremental improvement over existing augmentation strategies.

The paper tackled the challenge of training self-supervised speech models for low-resource languages by comparing audio augmentation techniques, finding that combined synthetic augmentations (noise and pitch) outperformed other methods like accent or language transfer for phoneme recognition.

Self-supervised representation learning (SSRL) has demonstrated superior performance than supervised models for tasks including phoneme recognition. Training SSRL models poses a challenge for low-resource languages where sufficient pre-training data may not be available. A common approach is cross-lingual pre-training. Instead, we propose to use audio augmentation techniques, namely: pitch variation, noise addition, accented target language and other language speech to pre-train SSRL models in a low resource condition and evaluate phoneme recognition. Our comparisons found that a combined synthetic augmentations (noise/pitch) strategy outperformed accent and language knowledge transfer. Furthermore, we examined the scaling factor of augmented data to achieve equivalent performance to model pre-trained with target domain speech. Our findings suggest that for resource-constrained languages, combined augmentations can be a viable option than other augmentations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes