AS CL SDMay 19, 2023

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

Siyuan Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang

arXiv:2305.11569v16.68 citations

Originality Incremental advance

AI Analysis

This work addresses low-resource speech recognition for multilingual applications, offering a method that is incremental but provides practical gains in data efficiency.

The paper tackles low-resource automatic speech recognition by integrating multilingual training and self-supervised learning, using International Phonetic Alphabet pseudo labels to guide HuBERT pretraining, resulting in consistent performance improvements over standard HuBERT and up to 75% reduction in supervised training data.

We improve low-resource ASR by integrating the ideas of multilingual training and self-supervised learning. Concretely, we leverage an International Phonetic Alphabet (IPA) multilingual model to create frame-level pseudo labels for unlabeled speech, and use these pseudo labels to guide hidden-unit BERT (HuBERT) based speech pretraining in a phonetically-informed manner. The experiments on the Multilingual Speech (MLS) Corpus show that the proposed approach consistently outperforms the standard HuBERT on all the target languages. Moreover, on 3 of the 4 languages, comparing to the standard HuBERT, the approach performs better, meanwhile is able to save supervised training data by 1.5k hours (75%) at most. Our approach outperforms most of the state of the arts, with much less pretraining data in terms of hours and language diversity. Compared to XLSR-53 and a retraining based multilingual method, our approach performs better with full and limited finetuning data scenarios.

View on arXiv PDF

Similar