LGAIDec 19, 2025

Trust-Region Adaptive Policy Optimization

arXiv:2512.17636v15 citationsh-index: 19
Originality Highly original
AI Analysis

This addresses a key inefficiency in post-training LLMs for complex reasoning, establishing a new paradigm that could benefit AI researchers and practitioners.

The paper tackles the inconsistency between supervised fine-tuning (SFT) and reinforcement learning (RL) in improving large language models' reasoning, proposing TRAPO, a hybrid framework that interleaves SFT and RL, which outperforms standard pipelines and state-of-the-art methods on five mathematical reasoning benchmarks.

Post-training methods, especially Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), play an important role in improving large language models' (LLMs) complex reasoning abilities. However, the dominant two-stage pipeline (SFT then RL) suffers from a key inconsistency: SFT enforces rigid imitation that suppresses exploration and induces forgetting, limiting RL's potential for improvements. We address this inefficiency with TRAPO (\textbf{T}rust-\textbf{R}egion \textbf{A}daptive \textbf{P}olicy \textbf{O}ptimization), a hybrid framework that interleaves SFT and RL within each training instance by optimizing SFT loss on expert prefixes and RL loss on the model's own completions, unifying external supervision and self-exploration. To stabilize training, we introduce Trust-Region SFT (TrSFT), which minimizes forward KL divergence inside a trust region but attenuates optimization outside, effectively shifting toward reverse KL and yielding stable, mode-seeking updates favorable for RL. An adaptive prefix-selection mechanism further allocates expert guidance based on measured utility. Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines, as well as recent state-of-the-art approaches, establishing a strong new paradigm for reasoning-enhanced LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes