LGDec 18, 2025

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

arXiv:2512.17008v27 citationsh-index: 63
Originality Incremental advance
AI Analysis

This work addresses the problem of improving multi-turn reinforcement learning for LLM agents, representing an incremental advancement in method adaptation.

The paper tackled the limitations of applying Group Relative Policy Optimization (GRPO) to multi-turn tasks in LLM agents by introducing turn-PPO, a variant of Proximal Policy Optimization (PPO) that uses turn-level MDP formulation, and demonstrated its effectiveness on WebShop and Sokoban datasets.

Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi-turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi-turn scenarios, we introduce turn-PPO, a variant that operates on a turn-level MDP formulation, as opposed to the commonly used token-level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn-PPO, both with and without long reasoning components.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes