CLSDNov 6, 2025

CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese

arXiv:2511.04139v13 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses low-resource Cantonese ASR, which is critical for language accessibility, by proposing an incremental improvement through hybrid methods.

The paper tackled the problem of low-resource Cantonese automatic speech recognition by introducing CantoASR, a collaborative ASR-LALM error correction framework that integrates acoustic features and prosody-aware correction, resulting in substantial character error rate gains over Whisper-Large-V3 on spontaneous data.

Automatic speech recognition (ASR) is critical for language accessibility, yet low-resource Cantonese remains challenging due to limited annotated data, six lexical tones, tone sandhi, and accent variation. Existing ASR models, such as Whisper, often suffer from high word error rates. Large audio-language models (LALMs), in contrast, can leverage broader contextual reasoning but still require explicit tonal and prosodic acoustic cues. We introduce CantoASR, a collaborative ASR-LALM error correction framework that integrates forced alignment for acoustic feature extraction, a LoRA-finetuned Whisper for improved tone discrimination, and an instruction-tuned Qwen-Audio for prosody-aware correction. Evaluations on spontaneous Cantonese data show substantial CER gains over Whisper-Large-V3. These findings suggest that integrating acoustic cues with LALM reasoning provides a scalable strategy for low-resource tonal and dialectal ASR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes