CLASMLDec 5, 2017

Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models

arXiv:1712.01818v1173 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of optimizing speech recognition models for actual performance metrics rather than proxy losses, which is incremental but impactful for improving accuracy in tasks like mobile voice search.

The authors tackled the mismatch between cross-entropy training and word error rate (WER) evaluation in attention-based sequence-to-sequence models for automatic speech recognition by proposing a training method that directly minimizes expected WER, resulting in up to 8.2% relative performance improvement over the baseline.

Sequence-to-sequence models, such as attention-based models in automatic speech recognition (ASR), are typically trained to optimize the cross-entropy criterion which corresponds to improving the log-likelihood of the data. However, system performance is usually measured in terms of word error rate (WER), not log-likelihood. Traditional ASR systems benefit from discriminative sequence training which optimizes criteria such as the state-level minimum Bayes risk (sMBR) which are more closely related to WER. In the present work, we explore techniques to train attention-based models to directly minimize expected word error rate. We consider two loss functions which approximate the expected number of word errors: either by sampling from the model, or by using N-best lists of decoded hypotheses, which we find to be more effective than the sampling-based method. In experimental evaluations, we find that the proposed training procedure improves performance by up to 8.2% relative to the baseline system. This allows us to train grapheme-based, uni-directional attention-based models which match the performance of a traditional, state-of-the-art, discriminative sequence-trained system on a mobile voice-search task.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes