LGCLMar 1

Stabilizing Policy Optimization via Logits Convexity

arXiv:2603.00963v11 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses training instability for researchers and practitioners using RL in LLMs, but it is incremental as it builds on existing optimization frameworks.

The authors tackled the instability of reinforcement learning optimization in large language models by showing that the convexity of supervised fine-tuning loss with respect to logits stabilizes training, and they proposed Logits Convex Optimization (LCO) to improve stability and outperform conventional methods on benchmarks.

While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical analysis demonstrates that this property induces favorable gradient directionality during optimization. In contrast, Proximal Policy Optimization (PPO), a widely adopted policy gradient algorithm utilizing a clipped surrogate objective, lacks this stabilizing property. Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework that aligns the learned policy with an optimal target derived from the original RL objective, thereby emulating the stabilizing effects of logits-level convexity. Extensive experiments across multiple model families show that our LCO framework consistently improves training stability and outperforms conventional RL methods on a broad range of benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes