LGNov 19, 2015

Task Loss Estimation for Sequence Prediction

Dzmitry Bahdanau, Dmitriy Serdyuk, Philémon Brakel, Nan Rosemary Ke, Jan Chorowski, Aaron Courville, Yoshua Bengio

arXiv:1511.06456v416.333 citationsHas Code

Originality Highly original

AI Analysis

This addresses the challenge of training models for sequence prediction tasks where direct optimization of task-specific metrics like edit distance or BLEU is not feasible, offering a novel approach for encoder-decoder models.

The paper tackles the problem of optimizing non-differentiable task losses in supervised machine learning by proposing a method to derive differentiable surrogate losses that are provably consistent with the task loss, resulting in a ~13% relative improvement in Character Error Rate for speech recognition without extra language modeling data.

Often, the performance on a supervised machine learning task is evaluated with a emph{task loss} function that cannot be optimized directly. Examples of such loss functions include the classification error, the edit distance and the BLEU score. A common workaround for this problem is to instead optimize a emph{surrogate loss} function, such as for instance cross-entropy or hinge loss. In order for this remedy to be effective, it is important to ensure that minimization of the surrogate loss results in minimization of the task loss, a condition that we call emph{consistency with the task loss}. In this work, we propose another method for deriving differentiable surrogate losses that provably meet this requirement. We focus on the broad class of models that define a score for every input-output pair. Our idea is that this score can be interpreted as an estimate of the task loss, and that the estimation error may be used as a consistent surrogate loss. A distinct feature of such an approach is that it defines the desirable value of the score for every input-output pair. We use this property to design specialized surrogate losses for Encoder-Decoder models often used for sequence prediction tasks. In our experiment, we benchmark on the task of speech recognition. Using a new surrogate loss instead of cross-entropy to train an Encoder-Decoder speech recognizer brings a significant ~13% relative improvement in terms of Character Error Rate (CER) in the case when no extra corpora are used for language modeling.

View on arXiv PDF Code

Similar