CLApr 23, 2018

ASR Performance Prediction on Unseen Broadcast Programs using Convolutional Neural Networks

arXiv:1804.08477v110 citations
Originality Incremental advance
AI Analysis

This work addresses a relatively new task of ASR performance prediction for broadcast programs, which is incremental as it builds on existing methods with a novel combination approach.

The paper tackles the problem of predicting ASR performance on unseen broadcast programs by comparing a regression baseline with engineered features to a new CNN-based approach using learnt features, finding that the CNN with combined textual and signal inputs achieves the best WER prediction performance and accurately predicts WER distributions.

In this paper, we address a relatively new task: prediction of ASR performance on unseen broadcast programs. We first propose an heterogenous French corpus dedicated to this task. Two prediction approaches are compared: a state-of-the-art performance prediction based on regression (engineered features) and a new strategy based on convolutional neural networks (learnt features). We particularly focus on the combination of both textual (ASR transcription) and signal inputs. While the joint use of textual and signal features did not work for the regression baseline, the combination of inputs for CNNs leads to the best WER prediction performance. We also show that our CNN prediction remarkably predicts the WER distribution on a collection of speech recordings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes