SD CL ASAug 3, 2021

The Performance Evaluation of Attention-Based Neural ASR under Mixed Speech Input

arXiv:2108.01245v14.31 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of ASR robustness in noisy, multi-speaker environments for speech recognition applications, but it is incremental as it applies an existing method to new data.

The paper evaluated the performance of an attention-based neural ASR model (LAS) under mixed speech conditions, finding a 65% relative increase in phoneme error rate at 0 dB target-to-interference ratio, with performance improving as the ratio increased to 30 dB.

In order to evaluate the performance of the attention based neural ASR under noisy conditions, the current trend is to present hours of various noisy speech data to the model and measure the overall word/phoneme error rate (W/PER). In general, it is unclear how these models perform when exposed to a cocktail party setup in which two or more speakers are active. In this paper, we present the mixtures of speech signals to a popular attention-based neural ASR, known as Listen, Attend, and Spell (LAS), at different target-to-interference ratio (TIR) and measure the phoneme error rate. In particular, we investigate in details when two phonemes are mixed what will be the predicted phoneme; in this fashion we build a model in which the most probable predictions for a phoneme are given. We found a 65% relative increase in PER when LAS was presented with mixed speech signals at TIR = 0 dB and the performance approaches the unmixed scenario at TIR = 30 dB. Our results show the model, when presented with mixed phonemes signals, tend to predict those that have higher accuracies during evaluation of original phoneme signals.

View on arXiv PDF Code

Similar