LGNov 19, 2015

Task Loss Estimation for Sequence Prediction

arXiv:1511.06456v433 citations
Originality Highly original
AI Analysis

This addresses the challenge of training models for sequence prediction tasks where direct optimization of task-specific metrics like edit distance or BLEU is not feasible, offering a novel approach for encoder-decoder models.

The paper tackles the problem of optimizing non-differentiable task losses in supervised machine learning by proposing a method to derive differentiable surrogate losses that are provably consistent with the task loss, resulting in a ~13% relative improvement in Character Error Rate for speech recognition without extra language modeling data.

Often, the performance on a supervised machine learning task is evaluated with a emph{task loss} function that cannot be optimized directly. Examples of such loss functions include the classification error, the edit distance and the BLEU score. A common workaround for this problem is to instead optimize a emph{surrogate loss} function, such as for instance cross-entropy or hinge loss. In order for this remedy to be effective, it is important to ensure that minimization of the surrogate loss results in minimization of the task loss, a condition that we call emph{consistency with the task loss}. In this work, we propose another method for deriving differentiable surrogate losses that provably meet this requirement. We focus on the broad class of models that define a score for every input-output pair. Our idea is that this score can be interpreted as an estimate of the task loss, and that the estimation error may be used as a consistent surrogate loss. A distinct feature of such an approach is that it defines the desirable value of the score for every input-output pair. We use this property to design specialized surrogate losses for Encoder-Decoder models often used for sequence prediction tasks. In our experiment, we benchmark on the task of speech recognition. Using a new surrogate loss instead of cross-entropy to train an Encoder-Decoder speech recognizer brings a significant ~13% relative improvement in terms of Character Error Rate (CER) in the case when no extra corpora are used for language modeling.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes