ASCLSDAug 10, 2024

Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text

arXiv:2408.05554v15 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving speech recognition for under-represented languages like Kazakh, offering a potentially generalizable method for low-resource settings.

The authors tackled the problem of poor automatic speech recognition performance for the low-resource language Kazakh using Whisper, by leveraging unpaired speech and text data with modifications like EOT judgment and hallucination penalty, achieving over 10% absolute WER reduction.

Whisper and other large-scale automatic speech recognition models have made significant progress in performance. However, their performance on many low-resource languages, such as Kazakh, is not satisfactory. It is worth researching how to utilize low-cost data to improve the performance of Whisper on under-represented languages. In this study, we utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh. We implemented end of transcript (EOT) judgment modification and hallucination penalty to improve the performance of speech recognition. Further, we employed the decoding average token log probability as a criterion to select samples from unlabeled speech data and used pseudo-labeled data to fine-tune the model to further improve its performance. Ultimately, we achieved more than 10\% absolute WER reduction in multiple experiments, and the whole process has the potential to be generalized to other under-represented languages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes