A Multi-modal Approach to Dysarthria Detection and Severity Assessment Using Speech and Text Information
This addresses the problem of improving diagnostic tools for dysarthria patients, though it appears incremental by extending existing single-modality methods to multi-modal fusion.
The paper tackled dysarthria detection and severity assessment by introducing a multi-modal approach that combines speech and text information using cross-attention mechanisms, achieving improved accuracies up to 99.53% for detection and 98.12% for severity assessment in speaker-dependent settings.
Automatic detection and severity assessment of dysarthria are crucial for delivering targeted therapeutic interventions to patients. While most existing research focuses primarily on speech modality, this study introduces a novel approach that leverages both speech and text modalities. By employing cross-attention mechanism, our method learns the acoustic and linguistic similarities between speech and text representations. This approach assesses specifically the pronunciation deviations across different severity levels, thereby enhancing the accuracy of dysarthric detection and severity assessment. All the experiments have been performed using UA-Speech dysarthric database. Improved accuracies of 99.53% and 93.20% in detection, and 98.12% and 51.97% for severity assessment have been achieved when speaker-dependent and speaker-independent, unseen and seen words settings are used. These findings suggest that by integrating text information, which provides a reference linguistic knowledge, a more robust framework has been developed for dysarthric detection and assessment, thereby potentially leading to more effective diagnoses.