Beyond WER: A Paired Acoustic Stress Test for Ambient Clinical Scribes
For developers of clinical AI systems, it exposes a critical safety gap in standard evaluation metrics and offers a practical fix.
The paper reveals that minor acoustic noise can nearly double unsafe outputs in ambient clinical scribes despite negligible Word Error Rate increase (0.71 pp), and proposes a lightweight mitigation strategy.
Ambient clinical scribes increasingly combine Automatic Speech Recognition with Large Language Models to automate documentation. However, traditional metrics like Word Error Rate mask systemic safety degradation. We present a paired acoustic stress test to isolate the causal impact of noise on clinical reasoning. For the same dialogues, we inject diverse noise types while keeping the downstream model configuration frozen. Crucially, we uncover a dangerous disconnect between signal fidelity and clinical safety. Stationary ambient noise increased the Word Error Rate by a negligible 0.71 percentage points yet nearly doubled the rate of unsafe outputs. Our analysis reveals that minor acoustic perturbations can invert clinical meaning without substantially inflating error rates. Furthermore, we demonstrate a lightweight mitigation strategy that mitigates safety degradation under noisy conditions without requiring model fine tuning.