TeaForN: Teacher-Forcing with N-grams
This addresses a fundamental issue in sequence generation for tasks like machine translation and summarization, but it is an incremental improvement over existing teacher-forcing methods.
The paper tackles the problems of exposure bias and lack of differentiability in sequence generation models trained with teacher-forcing by proposing TeaForN, a method using a stack of N decoders to enable updates based on N prediction steps, which improves generation quality on WMT 2014 English-French, CNN/Dailymail, and Gigaword benchmarks.
Sequence generation models trained with teacher-forcing suffer from issues related to exposure bias and lack of differentiability across timesteps. Our proposed method, Teacher-Forcing with N-grams (TeaForN), addresses both these problems directly, through the use of a stack of N decoders trained to decode along a secondary time axis that allows model parameter updates based on N prediction steps. TeaForN can be used with a wide class of decoder architectures and requires minimal modifications from a standard teacher-forcing setup. Empirically, we show that TeaForN boosts generation quality on one Machine Translation benchmark, WMT 2014 English-French, and two News Summarization benchmarks, CNN/Dailymail and Gigaword.