CL SD ASMay 20, 2025

Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages

Chin-Jou Li, Eunjung Yeo, Kwanghee Choi, Paula Andrea Pérez-Toro, Masao Someki, Rohan Kumar Das, Zhengjun Yue, Juan Rafael Orozco-Arroyave, Elmar Nöth, David R. Mortensen

CMU

arXiv:2505.14874v58.35 citationsh-index: 33Has CodeINTERSPEECH

Originality Incremental advance

AI Analysis

This work addresses the problem of data scarcity for dysarthric speech recognition in non-English languages, which is incremental as it adapts existing voice conversion techniques to a new domain.

The paper tackled the challenge of automatic speech recognition for dysarthric speech in low-resource languages by using a voice conversion model to generate dysarthric-like speech from healthy data, which improved ASR performance over baseline methods on Spanish, Italian, and Tamil datasets.

Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics.

View on arXiv PDF Code

Similar