CL SDSep 22, 2025

WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing

Yuhang Dai, Ziyu Zhang, Shuai Wang, Longhao Li, Zhao Guo, Tianlun Zuo, Shuiyuan Wang, Hongfei Xue, Chengyou Wang, Qing Wang, Xin Xu, Hui Bu

arXiv:2509.18004v113.012 citationsh-index: 10Has Code

Originality Synthesis-oriented

AI Analysis

This addresses a critical gap in dialectal speech processing for Sichuanese speakers, promoting AI equity and mitigating bias, though it is incremental as it focuses on data creation rather than novel algorithmic breakthroughs.

The authors tackled the scarcity of large-scale data for Sichuanese dialects by introducing WenetSpeech-Chuan, a 10,000-hour annotated corpus, and showed that models trained on it achieve state-of-the-art performance among open-source systems and comparable results to commercial services.

The scarcity of large-scale, open-source data for dialects severely hinders progress in speech technology, a challenge particularly acute for the widely spoken Sichuanese dialects of Chinese. To address this critical gap, we introduce WenetSpeech-Chuan, a 10,000-hour, richly annotated corpus constructed using our novel Chuan-Pipeline, a complete data processing framework for dialectal speech. To facilitate rigorous evaluation and demonstrate the corpus's effectiveness, we also release high-quality ASR and TTS benchmarks, WenetSpeech-Chuan-Eval, with manually verified transcriptions. Experiments show that models trained on WenetSpeech-Chuan achieve state-of-the-art performance among open-source systems and demonstrate results comparable to commercial services. As the largest open-source corpus for Sichuanese dialects, WenetSpeech-Chuan not only lowers the barrier to research in dialectal speech processing but also plays a crucial role in promoting AI equity and mitigating bias in speech technologies. The corpus, benchmarks, models, and receipts are publicly available on our project page.

View on arXiv PDF

Similar