CL AI ASSep 24, 2025

DRES: Benchmarking LLMs for Disfluency Removal

Maria Teleki, Sai Janjur, Haoran Liu, Oliver Grabner, Ketan Verma, Thomas Docog, Xiangjue Dong, Lingfeng Shi, Cong Wang, Stephanie Birkelbach, Jason Kim, Yin Zhang

arXiv:2509.20321v1h-index: 10Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses disfluency removal for speech-driven systems like conversational agents, but it is incremental as it focuses on benchmarking and evaluation rather than a new method.

The authors tackled the problem of disfluencies degrading speech-driven systems by introducing DRES, a benchmark for evaluating LLMs in disfluency removal, finding that simple segmentation improves performance and fine-tuning achieves near state-of-the-art precision and recall but harms generalization.

Disfluencies -- such as "um," "uh," interjections, parentheticals, and edited statements -- remain a persistent challenge for speech-driven systems, degrading accuracy in command interpretation, summarization, and conversational agents. We introduce DRES (Disfluency Removal Evaluation Suite), a controlled text-level benchmark that establishes a reproducible semantic upper bound for this task. DRES builds on human-annotated Switchboard transcripts, isolating disfluency removal from ASR errors and acoustic variability. We systematically evaluate proprietary and open-source LLMs across scales, prompting strategies, and architectures. Our results reveal that (i) simple segmentation consistently improves performance, even for long-context models; (ii) reasoning-oriented models tend to over-delete fluent tokens; and (iii) fine-tuning achieves near state-of-the-art precision and recall but harms generalization abilities. We further present a set of LLM-specific error modes and offer nine practical recommendations (R1-R9) for deploying disfluency removal in speech-driven pipelines. DRES provides a reproducible, model-agnostic foundation for advancing robust spoken-language systems.

View on arXiv PDF

Similar