CLSDASApr 25, 2024

Automatic Speech Recognition System-Independent Word Error Rate Estimation

arXiv:2404.16743v283 citationsh-index: 6LREC
Originality Incremental advance
AI Analysis

This addresses the inflexibility of domain-dependent WER estimators for real-world ASR applications, though it is incremental as it builds on prior work by extending it to system-independent scenarios.

The paper tackles the problem of estimating word error rate (WER) for automatic speech recognition systems without relying on specific ASR systems, proposing a system-independent method that uses simulated ASR output. It achieves state-of-the-art performance on out-of-domain data, with relative improvements of 17.58% in root mean square error and 18.21% in Pearson correlation coefficient on Switchboard and CALLHOME datasets.

Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems. In many applications, it is of interest to estimate WER given a pair of a speech utterance and a transcript. Previous work on WER estimation focused on building models that are trained with a specific ASR system in mind (referred to as ASR system-dependent). These are also domain-dependent and inflexible in real-world applications. In this paper, a hypothesis generation method for ASR System-Independent WER estimation (SIWE) is proposed. In contrast to prior work, the WER estimators are trained using data that simulates ASR system output. Hypotheses are generated using phonetically similar or linguistically more likely alternative words. In WER estimation experiments, the proposed method reaches a similar performance to ASR system-dependent WER estimators on in-domain data and achieves state-of-the-art performance on out-of-domain data. On the out-of-domain data, the SIWE model outperformed the baseline estimators in root mean square error and Pearson correlation coefficient by relative 17.58% and 18.21%, respectively, on Switchboard and CALLHOME. The performance was further improved when the WER of the training set was close to the WER of the evaluation dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes