Semantic-WER: A Unified Metric for the Evaluation of ASR Transcript for End Usability
This addresses the need for better evaluation metrics in ASR systems to improve usability in downstream tasks, though it appears incremental as it builds on existing critiques of WER.
The paper tackles the problem that word error rate (WER) is unsuitable for evaluating ASR transcripts in downstream tasks like SLU and information retrieval, and proposes Semantic-WER (SWER) as a unified metric that can be customized for such applications.
Recent advances in supervised, semi-supervised and self-supervised deep learning algorithms have shown significant improvement in the performance of automatic speech recognition(ASR) systems. The state-of-the-art systems have achieved a word error rate (WER) less than 5%. However, in the past, researchers have argued the non-suitability of the WER metric for the evaluation of ASR systems for downstream tasks such as spoken language understanding (SLU) and information retrieval. The reason is that the WER works at the surface level and does not include any syntactic and semantic knowledge.The current work proposes Semantic-WER (SWER), a metric to evaluate the ASR transcripts for downstream applications in general. The SWER can be easily customized for any down-stream task.