Post-Completion Learning for Language Models
This addresses the issue of inefficient training for language models, offering a novel approach to boost output quality without compromising deployment efficiency, though it appears incremental as it builds on existing SFT and RL techniques.
The paper tackles the problem of language models stopping learning at the end-of-sequence token by proposing Post-Completion Learning (PCL), a framework that uses post-completion space to enhance reasoning and self-evaluation, resulting in consistent improvements over traditional methods on various datasets and models.
Current language model training paradigms typically terminate learning upon reaching the end-of-sequence (<eos>) token, overlooking the potential learning opportunities in the post-completion space. We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion, to enhance both the reasoning and self-evaluation abilities. PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point. To fully utilize this post-completion space, we design a white-box reinforcement learning method: let the model evaluate the output content according to the reward rules, then calculate and align the score with the reward functions for supervision. We implement dual-track SFT to optimize both reasoning and evaluation capabilities, and mixed it with RL training to achieve multi-objective hybrid optimization. Experimental results on different datasets and models demonstrate consistent improvements over traditional SFT and RL methods. Our method provides a new technical path for language model training that enhances output quality while preserving deployment efficiency.