LGJun 19, 2025

Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic, Semantic, and NLI Approaches

arXiv:2506.16528v18 citationsh-index: 44INTERSPEECH
Originality Incremental advance
AI Analysis

This addresses the need for better ASR evaluation for dysarthric and dysphonic speech users, but is incremental as it builds on existing NLI and similarity methods.

The paper tackled the problem that traditional ASR metrics like WER and CER fail to capture intelligibility for dysarthric and dysphonic speech, and proposed a novel metric integrating NLI, semantic, and phonetic similarity, achieving a 0.890 correlation with human judgments on Speech Accessibility Project data.

Traditional ASR metrics like WER and CER fail to capture intelligibility, especially for dysarthric and dysphonic speech, where semantic alignment matters more than exact word matches. ASR systems struggle with these speech types, often producing errors like phoneme repetitions and imprecise consonants, yet the meaning remains clear to human listeners. We identify two key challenges: (1) Existing metrics do not adequately reflect intelligibility, and (2) while LLMs can refine ASR output, their effectiveness in correcting ASR transcripts of dysarthric speech remains underexplored. To address this, we propose a novel metric integrating Natural Language Inference (NLI) scores, semantic similarity, and phonetic similarity. Our ASR evaluation metric achieves a 0.890 correlation with human judgments on Speech Accessibility Project data, surpassing traditional methods and emphasizing the need to prioritize intelligibility over error-based measures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes