Z-Scores: A Metric for Linguistically Assessing Disfluency Removal
This provides a diagnostic tool for researchers in speech processing to identify model failure modes and design targeted interventions, though it is incremental as it builds on existing evaluation methods.
The paper tackles the problem of evaluating disfluency removal in speech by introducing Z-Scores, a span-level linguistically-grounded metric that categorizes system behavior across distinct disfluency types, revealing systematic weaknesses that traditional word-level metrics obscure, with a case study showing it uncovers hidden challenges in LLMs.
Evaluating disfluency removal in speech requires more than aggregate token-level scores. Traditional word-based metrics such as precision, recall, and F1 (E-Scores) capture overall performance but cannot reveal why models succeed or fail. We introduce Z-Scores, a span-level linguistically-grounded evaluation metric that categorizes system behavior across distinct disfluency types (EDITED, INTJ, PRN). Our deterministic alignment module enables robust mapping between generated text and disfluent transcripts, allowing Z-Scores to expose systematic weaknesses that word-level metrics obscure. By providing category-specific diagnostics, Z-Scores enable researchers to identify model failure modes and design targeted interventions -- such as tailored prompts or data augmentation -- yielding measurable performance improvements. A case study with LLMs shows that Z-Scores uncover challenges with INTJ and PRN disfluencies hidden in aggregate F1, directly informing model refinement strategies.