Investigating the Decoders of Maximum Likelihood Sequence Models: A Look-ahead Approach
This work addresses decoding inefficiencies in sequence models for tasks like OCR and machine translation, offering incremental improvements with practical applications.
The paper tackles the problem of improving decoding in maximum likelihood sequence models by proposing a k-step look-ahead module to incorporate multi-step future information, which boosts performance on simpler datasets like IM2LATEX-100k and WMT16 multimodal machine translation but shows marginal gains on more difficult tasks like WMT14 machine translation. It further addresses the overestimated EOS probability issue by integrating an auxiliary EOS loss, enhancing both the look-ahead module and beam search robustness.
We demonstrate how we can practically incorporate multi-step future information into a decoder of maximum likelihood sequence models. We propose a "k-step look-ahead" module to consider the likelihood information of a rollout up to k steps. Unlike other approaches that need to train another value network to evaluate the rollouts, we can directly apply this look-ahead module to improve the decoding of any sequence model trained in a maximum likelihood framework. We evaluate our look-ahead module on three datasets of varying difficulties: IM2LATEX-100k OCR image to LaTeX, WMT16 multimodal machine translation, and WMT14 machine translation. Our look-ahead module improves the performance of the simpler datasets such as IM2LATEX-100k and WMT16 multimodal machine translation. However, the improvement of the more difficult dataset (e.g., containing longer sequences), WMT14 machine translation, becomes marginal. Our further investigation using the k-step look-ahead suggests that the more difficult tasks suffer from the overestimated EOS (end-of-sentence) probability. We argue that the overestimated EOS probability also causes the decreased performance of beam search when increasing its beam width. We tackle the EOS problem by integrating an auxiliary EOS loss into the training to estimate if the model should emit EOS or other words. Our experiments show that improving EOS estimation not only increases the performance of our proposed look-ahead module but also the robustness of the beam search.