CLMar 25

From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

arXiv:2603.2395188.71 citationsh-index: 17
AI Analysis

This addresses the problem of automating algorithm discovery for language model optimization, offering a novel framework that is incremental in building upon existing methods like GRPO.

The paper tackles the costly manual process of discovering improved policy optimization algorithms for language models by proposing POISE, a closed-loop framework for automated discovery, which in experiments discovered mechanisms that improved weighted Overall from 47.8 to 52.5 and increased AIME25 pass@32 from 26.7% to 43.3%.

Discovering improved policy optimization algorithms for language models remains a costly manual process requiring repeated mechanism-level modification and validation. Unlike simple combinatorial code search, this problem requires searching over algorithmic mechanisms tightly coupled with training dynamics while reusing empirical evidence across iterations. We propose POISE, a closed-loop framework for automated discovery of policy optimization algorithms for language models. POISE maintains a structured, genealogically linked archive linking proposals, executable implementations, standardized evaluations, and natural-language reflections to support evidence-driven iteration. In mathematical reasoning experiments starting from GRPO, POISE evaluates 64 candidate algorithms and discovers improved mechanisms, including analytic-variance scaling and validity masking. The best variant improves weighted Overall from 47.8 to 52.5 (+4.6) and increases AIME25 pass@32 from 26.7% to 43.3%, demonstrating the feasibility of automated policy optimization discovery while supporting interpretable design principles.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes