CL SD ASOct 13, 2022

Multilingual Zero Resource Speech Recognition Base on Self-Supervise Pre-Trained Acoustic Models

Haoyu Wang, Wei-Qiang Zhang, Hongbin Suo, Yulong Wan

arXiv:2210.06936v10.31 citationsh-index: 18

Originality Incremental advance

AI Analysis

This addresses the challenge of building speech recognition systems for low-resource languages, though it is an incremental extension of existing methods to word-level tasks.

The paper tackled the problem of word-level speech recognition for languages without labeled audio data by fine-tuning self-supervised pre-trained models on IPA phoneme transcriptions and using a language model, achieving an average word error rate of 33.77% across 8 languages, with some languages below 20%.

Labeled audio data is insufficient to build satisfying speech recognition systems for most of the languages in the world. There have been some zero-resource methods trying to perform phoneme or word-level speech recognition without labeled audio data of the target language, but the error rate of these methods is usually too high to be applied in real-world scenarios. Recently, the representation ability of self-supervise pre-trained models has been found to be extremely beneficial in zero-resource phoneme recognition. As far as we are concerned, this paper is the first attempt to extend the use of pre-trained models into word-level zero-resource speech recognition. This is done by fine-tuning the pre-trained models on IPA phoneme transcriptions and decoding with a language model trained on extra texts. Experiments on Wav2vec 2.0 and HuBERT models show that this method can achieve less than 20% word error rate on some languages, and the average error rate on 8 languages is 33.77%.

View on arXiv PDF

Similar