Improving Next Tokens via Second-to-Last Predictions with Generate and Refine
This work addresses a specific bottleneck in language modeling for researchers and practitioners, offering incremental improvements in prediction accuracy.
The paper tackles the problem of improving next token predictions in autoregressive language models by training a decoder-only architecture to predict second-to-last tokens, which achieves over 15% higher accuracy than standard next token predictions. It uses a generate-then-refine approach to combine these predictions, resulting in consistent and significant gains in next-token prediction accuracy.
Autoregressive language models like GPT aim to predict next tokens, while autoencoding models such as BERT are trained on tasks such as predicting masked tokens. We train a decoder-only architecture for predicting the second to last token for a sequence of tokens. Our approach yields higher computational training efficiency than BERT-style models by employing a structured deterministic approach to masking tokens. We use our model to improve the next token predictions of a standard GPT by combining both predictions in a ``generate-then-refine'' approach. We demonstrate on different variants of GPT-2 and different datasets that (not unexpectedly) second to last token predictions are much more accurate, i.e., more than 15\% higher accuracy than standard next token predictions. The ``generate-then-refine'' approach also demonstrates notable improvements in next-token predictions, yielding smaller yet consistent and significant gains.