AIOct 8, 2025

TGPR: Tree-Guided Policy Refinement for Robust Self-Debugging of LLMs

arXiv:2510.06878v1
Originality Highly original
AI Analysis

It addresses the problem of robust self-debugging and iterative refinement in LLMs for tasks like code generation, offering a general framework that is incremental but with strong specific gains.

The paper tackles the challenge of effectively searching through the refinement space in iterative refinement for LLMs by introducing TGPR, which combines GRPO with Thompson-Sampling-based tree search, achieving up to +4.2 percentage points improvement in pass@1 on MBPP and +12.51 percentage points in pass@10 on APPS compared to a baseline.

Iterative refinement has been a promising paradigm to enable large language models (LLMs) to resolve difficult reasoning and problem-solving tasks. One of the key challenges, however, is how to effectively search through the enormous search space of possible refinements. Existing methods typically fall back on predefined heuristics, which are troubled by the exploration-exploitation dilemma and cannot adapt based on past refinement outcomes. We introduce Tree-Guided Policy Refinement (TGPR), a novel framework that combines GRPO with a Thompson-Sampling-based tree search. TGPR explores both failed and successful refinement paths actively, with denser training trajectories and more adaptive policies. On HumanEval, MBPP, and APPS benchmarks, our method achieves up to +4.2 percentage points absolute improvement in pass@1 (on MBPP) and up to +12.51 percentage points absolute improvement in pass@10 (on APPS) compared to a competitive GRPO baseline. Apart from debugging code, TGPR focuses on a principled approach to combining learned policies with structured search methods, offering a general framework for enhancing iterative refinement and stateful reasoning in LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes