CLLGSDASOct 7, 2020

WER we are and WER we think we are

arXiv:2010.03432v11005 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of overoptimistic ASR evaluations for researchers and practitioners, highlighting issues with current benchmarks and proposing guidelines for more realistic datasets.

The authors challenge recent claims of low Word Error Rates (WERs) in ASR systems by testing three commercial systems on real-life conversations and a public benchmark, showing significantly higher WERs than reported.

Natural language processing of conversational speech requires the availability of high-quality transcripts. In this paper, we express our skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets. We outline several problems with popular benchmarks and compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB'05 public benchmark. We show that WERs are significantly higher than the best reported results. We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes