CLASMay 25

Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems

arXiv:2605.2540484.6
Predicted impact top 35% in CL · last 90 daysOriginality Highly original
AI Analysis

For developers of industrial spoken dialogue systems, this work provides a practical method to mitigate error propagation in cascaded ASR-LLM pipelines, improving robustness without sacrificing perceptual verifiability.

The paper proposes a cause-aware error recovery paradigm for cascaded ASR-LLM spoken dialogue systems that disentangles token-level errors into perception, comprehension, and deletion failures using precision-focused detectors. This approach more than doubles recall on domain-shift errors (57.96% vs. 23.66%), reduces WER by up to 30%, and improves downstream task performance by 17% across diverse conditions.

Cascaded Automatic Speech Recognition -- Large Language Model (ASR-LLM) pipelines remain popular for industrial Spoken Dialogue Systems (SDS), primarily because their decoupled design ensures perceptual verifiability. However, cascaded systems suffer from error propagation, as transcription failures inevitably cascade to subsequent components, thereby degrading the final interaction quality. Although ASR confidence scores offer a simple filter for unreliable inputs, this approach is fundamentally limited because it typically fails to detect deletion errors or to distinguish between acoustic (inability to hear clearly) and linguistic (inability to understand) mismatches, both of which require targeted recovery strategies. In this paper, we propose a cause-aware error recovery paradigm that fundamentally rethinks robustness in SDS. Unlike traditional confidence filtering, we introduce a suite of small precision-focused detectors that exploit deep ASR latent representations to disentangle token-level errors into perception, comprehension, and deletion failures. This fine-grained diagnostic intelligence empowers the LLM to orchestrate targeted, multi-turn clarification strategies, effectively transforming ambiguous signals into seamless user interactions. Experimental results validate the precision of our approach, which more than doubles the recall on domain-shift errors (57.96% vs. 23.66%) compared to baselines. Crucially, this diagnostic precision yields up to a 30% reduction in WER and a 17% improvement on the downstream task across diverse accents, distortions, and domains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes