ASAILGSDJan 17, 2025

Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR

arXiv:2501.10256v16 citationsh-index: 62025 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)
Originality Incremental advance
AI Analysis

This addresses the challenge of making ASR systems more accessible for individuals with dysarthria, though it is incremental as it builds on existing conversion techniques.

The paper tackles the problem of poor ASR performance on dysarthric speech by proposing an unsupervised method to convert dysarthric to healthy speech, finding that rhythm conversion improves ASR accuracy for more severe cases.

Automatic speech recognition (ASR) systems are well known to perform poorly on dysarthric speech. Previous works have addressed this by speaking rate modification to reduce the mismatch with typical speech. Unfortunately, these approaches rely on transcribed speech data to estimate speaking rates and phoneme durations, which might not be available for unseen speakers. Therefore, we combine unsupervised rhythm and voice conversion methods based on self-supervised speech representations to map dysarthric to typical speech. We evaluate the outputs with a large ASR model pre-trained on healthy speech without further fine-tuning and find that the proposed rhythm conversion especially improves performance for speakers of the Torgo corpus with more severe cases of dysarthria. Code and audio samples are available at https://idiap.github.io/RnV .

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes