SDCLMar 2

DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement

arXiv:2603.01369v1APSIPA
Originality Incremental advance
AI Analysis

This work addresses the problem of improving ASR for dysarthric speakers, which is a domain-specific incremental advancement.

The paper tackled the challenge of dysarthric speech recognition by proposing DARS, a synthesis framework that models pathological rhythm and acoustic style, achieving a 54.22% relative reduction in word error rate compared to state-of-the-art methods.

Dysarthric speech exhibits abnormal prosody and significant speaker variability, presenting persistent challenges for automatic speech recognition (ASR). While text-to-speech (TTS)-based data augmentation has shown potential, existing methods often fail to accurately model the pathological rhythm and acoustic style of dysarthric speech. To address this, we propose DARS, a dysarthria-aware rhythm-style synthesis framework based on the Matcha-TTS architecture. DARS incorporates a multi-stage rhythm predictor optimized by contrastive preferences between normal and dysarthric speech, along with a dysarthric-style conditional flow matching mechanism, jointly enhancing temporal rhythm reconstruction and pathological acoustic style simulation. Experiments on the TORGO dataset demonstrate that DARS achieves a Mean Cepstral Distortion (MCD) of 4.29, closely approximating real dysarthric speech. Adapting a Whisper-based ASR system with synthetic dysarthric speech from DARS achieves a 54.22% relative reduction in word error rate (WER) compared to state-of-the-art methods, demonstrating the framework's effectiveness in enhancing recognition performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes