LGAICLNov 18, 2025

Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization

arXiv:2511.14846v14 citations
Originality Highly original
AI Analysis

This addresses the problem of training stagnation in multi-turn reasoning tasks for AI researchers, representing an incremental improvement over prior reinforcement learning approaches.

The paper tackled the challenge of training Large Language Models for multi-turn Tool-Integrated Reasoning by proposing Group Turn Policy Optimization, which outperformed existing methods by 3.0% on average across reasoning benchmarks.

Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches. Current RL methods, exemplified by Group Relative Policy Optimization (GRPO), suffer from coarse-grained, trajectory-level rewards that provide insufficient learning signals for complex multi-turn interactions, leading to training stagnation. To address this issue, we propose Group Turn Policy Optimization (GTPO), a novel RL algorithm specifically designed for training LLMs on multi-turn TIR tasks. GTPO introduces three key innovations: (1) turn-level reward assignment that provides fine-grained feedback for individual turns, (2) return-based advantage estimation where normalized discounted returns are calculated as advantages, and (3) self-supervised reward shaping that exploits self-supervision signals from generated code to densify sparse binary outcome-based rewards. Our comprehensive evaluation demonstrates that GTPO outperforms GRPO by 3.0% on average across diverse reasoning benchmarks, establishing its effectiveness for advancing complex mathematical reasoning in the real world.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes