CLJan 23

TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization

arXiv:2601.16480v1h-index: 21
Originality Highly original
AI Analysis

This addresses a specific bottleneck in iterative optimization for scientific tasks like analog circuit sizing, offering a practical improvement over existing methods.

The paper tackles the challenge of fine-grained optimization in iterative reasoning tasks where existing RL methods fail to optimize at the turn level, proposing TL-GRPO, which outperforms standard GRPO and Bayesian optimization on analog circuit sizing tasks, with a 30B model achieving state-of-the-art performance under the same simulation budget.

Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration, which is typically framed as a Markov Decision Process and optimized with trajectory-level RL algorithms such as GRPO. However, a common class of reasoning tasks, iterative optimization, presents distinct challenges: the agent interacts with the same underlying environment state across turns, and the value of a trajectory is determined by the best turn-level reward rather than cumulative returns. Existing GRPO-based methods cannot perform fine-grained, turn-level optimization in such settings, while black-box optimization methods discard prior knowledge and reasoning capabilities. To address this gap, we propose Turn-Level GRPO (TL-GRPO), a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization. We evaluate TL-GRPO on analog circuit sizing (ACS), a challenging scientific optimization task requiring multiple simulations and domain expertise. Results show that TL-GRPO outperforms standard GRPO and Bayesian optimization methods across various specifications. Furthermore, our 30B model trained with TL-GRPO achieves state-of-the-art performance on ACS tasks under same simulation budget, demonstrating both strong generalization and practical utility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes