Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics
This work addresses the problem of accurately assessing conversational ASR systems for researchers and practitioners, though it is incremental as it builds on existing metrics and benchmarks.
The paper tackled the challenge of evaluating spoken language models for conversational automatic speech recognition in multi-speaker settings by introducing tcpSemER, a semantic similarity metric, and decomposing tcpWER for overlap analysis, finding that LLM-based systems degrade with more speakers and overlap while modular pipelines remain robust.
Conversational automatic speech recognition remains challenging due to overlapping speech, far-field noise, and varying speaker counts. While recent LLM-based systems perform well on single-speaker benchmarks, their robustness in multi-speaker settings is unclear. We systematically compare LLM-based and modular pipeline approaches along four axes: overlap robustness, semantic fidelity, speaker count, and single- versus multi-channel input. To capture meaning-altering errors that conventional metrics miss, we introduce tcpSemER, which extends tcpWER by replacing Levenshtein distance with embedding-based semantic similarity. We further decompose tcpWER into overlapping and non-overlapping components for finer-grained analysis. Experiments across three datasets show that LLM-based systems are competitive in two-speaker settings but degrade as speaker count and overlap increase, whereas modular pipelines remain more robust.