CLSDASMay 27, 2025

Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis

arXiv:2505.21138v25 citationsh-index: 4Has CodeINTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses the problem of low-resource speech recognition for Chinese dialects, which is an incremental improvement over existing methods.

The paper tackled the challenge of speech recognition for Chinese dialects and accents by applying self-supervised pre-training and large language models, achieving state-of-the-art results on multiple dialect datasets such as Kespeech.

Large-scale training corpora have significantly improved the performance of ASR models. Unfortunately, due to the relative scarcity of data, Chinese accents and dialects remain a challenge for most ASR models. Recent advancements in self-supervised learning have shown that self-supervised pre-training, combined with large language models (LLM), can effectively enhance ASR performance in low-resource scenarios. We aim to investigate the effectiveness of this paradigm for Chinese dialects. Specifically, we pre-train a Data2vec2 model on 300,000 hours of unlabeled dialect and accented speech data and do alignment training on a supervised dataset of 40,000 hours. Then, we systematically examine the impact of various projectors and LLMs on Mandarin, dialect, and accented speech recognition performance under this paradigm. Our method achieved SOTA results on multiple dialect datasets, including Kespeech. We will open-source our work to promote reproducible research

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes