LGAICLMLOct 2, 2018

Optimal Completion Distillation for Sequence Learning

arXiv:1810.01398v245 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of efficient and hyperparameter-free training for sequence learning, particularly in speech recognition, with incremental improvements over existing methods.

The paper tackles the problem of training sequence-to-sequence models by introducing Optimal Completion Distillation (OCD), which uses edit distance and dynamic programming to optimize suffix selection, achieving state-of-the-art performance with 9.3% WER on Wall Street Journal and 4.5% WER on Librispeech datasets.

We present Optimal Completion Distillation (OCD), a training procedure for optimizing sequence to sequence models based on edit distance. OCD is efficient, has no hyper-parameters of its own, and does not require pretraining or joint optimization with conditional log-likelihood. Given a partial sequence generated by the model, we first identify the set of optimal suffixes that minimize the total edit distance, using an efficient dynamic programming algorithm. Then, for each position of the generated sequence, we use a target distribution that puts equal probability on the first token of all the optimal suffixes. OCD achieves the state-of-the-art performance on end-to-end speech recognition, on both Wall Street Journal and Librispeech datasets, achieving $9.3\%$ WER and $4.5\%$ WER respectively.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes