ASLGSDDec 18, 2024

Investigating the Effects of Diffusion-based Conditional Generative Speech Models Used for Speech Enhancement on Dysarthric Speech

arXiv:2412.13933v11 citationsh-index: 102025 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of speech enhancement for dysarthric speech, highlighting limitations of existing models for individuals with Parkinson's disease, and is incremental as it explores a new application of pre-trained models.

The study investigated the impact of pre-trained diffusion-based conditional generative speech models on dysarthric speech from Parkinson's disease in noise-free conditions, finding that these models remove acoustic dysarthric cues during enhancement, making them unsuitable for this context, but the removed cues can provide complementary information when fused with the original signal.

In this study, we aim to explore the effect of pre-trained conditional generative speech models for the first time on dysarthric speech due to Parkinson's disease recorded in an ideal/non-noisy condition. Considering one category of generative models, i.e., diffusion-based speech enhancement, these models are previously trained to learn the distribution of clean (i.e, recorded in a noise-free environment) typical speech signals. Therefore, we hypothesized that when being exposed to dysarthric speech they might remove the unseen atypical paralinguistic cues during the enhancement process. By considering the automatic dysarthric speech detection task, in this study, we experimentally show that during the enhancement process of dysarthric speech data recorded in an ideal non-noisy environment, some of the acoustic dysarthric speech cues are lost. Therefore such pre-trained models are not yet suitable in the context of dysarthric speech enhancement since they manipulate the pathological speech cues when they process clean dysarthric speech. Furthermore, we show that the removed acoustics cues by the enhancement models in the form of residue speech signal can provide complementary dysarthric cues when fused with the original input speech signal in the feature space.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes