ASAIMay 22, 2025

Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection

arXiv:2505.16351v211 citationsh-index: 97INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses the need for efficient transcription of disordered speech to aid speech-language pathologists in diagnostics and treatment planning, representing a novel method for a known bottleneck in clinical applications.

The paper tackles the problem of automatic speech dysfluency transcription and detection by introducing Dysfluent-WFST, a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency without additional training, achieving state-of-the-art performance in phonetic error rate and dysfluency detection on simulated and real speech data.

Automatic detection of speech dysfluency aids speech-language pathologists in efficient transcription of disordered speech, enhancing diagnostics and treatment planning. Traditional methods, often limited to classification, provide insufficient clinical insight, and text-independent models misclassify dysfluency, especially in context-dependent cases. This work introduces Dysfluent-WFST, a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency. Unlike previous models, Dysfluent-WFST operates with upstream encoders like WavLM and requires no additional training. It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data. Our approach is lightweight, interpretable, and effective, demonstrating that explicit modeling of pronunciation behavior in decoding, rather than complex architectures, is key to improving dysfluency processing systems.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes