CL AI ASSep 24, 2025

Z-Scores: A Metric for Linguistically Assessing Disfluency Removal

Maria Teleki, Sai Janjur, Haoran Liu, Oliver Grabner, Ketan Verma, Thomas Docog, Xiangjue Dong, Lingfeng Shi, Cong Wang, Stephanie Birkelbach, Jason Kim, Yin Zhang

arXiv:2509.20319v11 citationsh-index: 10

Originality Incremental advance

AI Analysis

This provides a diagnostic tool for researchers in speech processing to identify model failure modes and design targeted interventions, though it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of evaluating disfluency removal in speech by introducing Z-Scores, a span-level linguistically-grounded metric that categorizes system behavior across distinct disfluency types, revealing systematic weaknesses that traditional word-level metrics obscure, with a case study showing it uncovers hidden challenges in LLMs.

Evaluating disfluency removal in speech requires more than aggregate token-level scores. Traditional word-based metrics such as precision, recall, and F1 (E-Scores) capture overall performance but cannot reveal why models succeed or fail. We introduce Z-Scores, a span-level linguistically-grounded evaluation metric that categorizes system behavior across distinct disfluency types (EDITED, INTJ, PRN). Our deterministic alignment module enables robust mapping between generated text and disfluent transcripts, allowing Z-Scores to expose systematic weaknesses that word-level metrics obscure. By providing category-specific diagnostics, Z-Scores enable researchers to identify model failure modes and design targeted interventions -- such as tailored prompts or data augmentation -- yielding measurable performance improvements. A case study with LLMs shows that Z-Scores uncover challenges with INTJ and PRN disfluencies hidden in aggregate F1, directly informing model refinement strategies.

View on arXiv PDF

Similar