CL AIJul 27, 2025

Post-Completion Learning for Language Models

Xiang Fei, Siqi Wang, Shu Wei, Yuxiang Nie, Wei Shi, Hao Feng, Chao Feng, Can Huang

arXiv:2507.20252v32 citationsh-index: 10

Originality Incremental advance

AI Analysis

This addresses the issue of inefficient training for language models, offering a novel approach to boost output quality without compromising deployment efficiency, though it appears incremental as it builds on existing SFT and RL techniques.

The paper tackles the problem of language models stopping learning at the end-of-sequence token by proposing Post-Completion Learning (PCL), a framework that uses post-completion space to enhance reasoning and self-evaluation, resulting in consistent improvements over traditional methods on various datasets and models.

Current language model training paradigms typically terminate learning upon reaching the end-of-sequence (<eos>) token, overlooking the potential learning opportunities in the post-completion space. We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion, to enhance both the reasoning and self-evaluation abilities. PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point. To fully utilize this post-completion space, we design a white-box reinforcement learning method: let the model evaluate the output content according to the reward rules, then calculate and align the score with the reward functions for supervision. We implement dual-track SFT to optimize both reasoning and evaluation capabilities, and mixed it with RL training to achieve multi-objective hybrid optimization. Experimental results on different datasets and models demonstrate consistent improvements over traditional SFT and RL methods. Our method provides a new technical path for language model training that enhances output quality while preserving deployment efficiency.

View on arXiv PDF

Similar