CLSDASMay 20, 2025

Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages

CMU
arXiv:2505.14874v55 citationsh-index: 33INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses the problem of data scarcity for dysarthric speech recognition in non-English languages, which is incremental as it adapts existing voice conversion techniques to a new domain.

The paper tackled the challenge of automatic speech recognition for dysarthric speech in low-resource languages by using a voice conversion model to generate dysarthric-like speech from healthy data, which improved ASR performance over baseline methods on Spanish, Italian, and Tamil datasets.

Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes