CLJan 12

Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects

arXiv:2601.07274v1h-index: 4Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of limited technological support for Chinese dialects, which affects hundreds of millions of speakers, by providing tools for dialect-to-Mandarin speech-LLMs, though it is incremental as it builds on existing ASR methods.

The paper tackles the lack of speech and language technologies for Chinese dialects by developing a speech encoder with cross-dialect semantic alignment between dialects and Mandarin, achieving state-of-the-art ASR performance on Chinese dialects and enabling speech-to-speech retrieval on a new benchmark.

Despite having hundreds of millions of speakers, Chinese dialects lag behind Mandarin in speech and language technologies. Most varieties are primarily spoken, making dialect-to-Mandarin speech-LLMs (large language models) more practical than dialect LLMs. Building dialect-to-Mandarin speech-LLMs requires speech representations with cross-dialect semantic alignment between Chinese dialects and Mandarin. In this paper, we achieve such a cross-dialect semantic alignment by training a speech encoder with ASR (automatic speech recognition)-only data, as demonstrated by speech-to-speech retrieval on a new benchmark of spoken Chinese varieties that we contribute. Our speech encoder further demonstrates state-of-the-art ASR performance on Chinese dialects. Together, our Chinese dialect benchmark, semantically aligned speech representations, and speech-to-speech retrieval evaluation lay the groundwork for future Chinese dialect speech-LLMs. We release the benchmark at https://github.com/kalvinchang/yubao.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes