CLJun 16, 2025

BOW: Reinforcement Learning for Bottlenecked Next Word Prediction

Ming Shen, Zhikun Xu, Jacob Dineen, Xiao Ye, Ben Zhou

arXiv:2506.13502v28.33 citationsh-index: 5

Originality Incremental advance

AI Analysis

This addresses the need for better reasoning capabilities in language models, though it is incremental as it builds on existing next-word prediction methods.

The paper tackles the problem of limited explicit reasoning in large language models by introducing BOW, a reinforcement learning formulation of next-word prediction that inserts a reasoning bottleneck, resulting in improved zero-shot reasoning by nearly 5% on average across ten benchmarks.

Large language models (LLMs) are typically pretrained with next-word prediction (NWP), which yields strong surface fluency but places limited pressure on models to form explicit reasoning before emitting tokens. We study whether shifting the supervision signal can better elicit explicit reasoning and, more broadly, strengthen models' general reasoning capability. We present BOttlenecked next-Word prediction (BOW), a RL formulation of NWP that inserts an intermediate reasoning bottleneck. Instead of predicting the next word directly from context, the policy model must first generate a next-word reasoning trajectory. A frozen scorer then assigns this trajectory a soft, distributional reward equal to the probability of the gold next token conditioned solely on the trajectory to guide the RL optimization. We also propose an optional L1-style regularizer on the reward to discourage "name-the-answer" shortcuts. Across ten benchmarks, a brief BOW adaptation phase on Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct improves zero-shot reasoning and outperforms strong continual-pretraining baselines, including an RL variant with a hard, binary reward and a supervised finetuning approach with augmented data, by nearly 5% on average, while achieving the top result in 7 of 10 intrinsic NWP evaluations. These results indicate that BOW is a viable alternative to vanilla NWP, inducing explicit next-word reasoning and strengthening general reasoning ability.

View on arXiv PDF

Similar