LGJan 21

CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning

arXiv:2601.15141v14.93 citationsh-index: 4

Originality Highly original

AI Analysis

This addresses a critical credit assignment issue in agentic RL for smaller models, offering a scalable solution to improve efficiency and performance in complex problem-solving tasks.

The paper tackles the problem of noisy trajectories hindering policy optimization in agentic reinforcement learning for parameter-constrained models, proposing CLEANER to self-purify trajectories and achieving average accuracy gains of 6%, 3%, and 5% over baselines on benchmarks like AIME24/25, GPQA, and LiveCodeBench.

Agentic Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to utilize tools like Python interpreters for complex problem-solving. However, for parameter-constrained models (e.g., 4B--7B), the exploration phase is often plagued by frequent execution failures, creating noisy trajectories that hinder policy optimization. Under standard outcome-based reward settings, this noise leads to a critical credit assignment issue, where erroneous actions are inadvertently reinforced alongside successful outcomes. Existing mitigations face a dilemma: dense rewards often trigger reward hacking, while supersampling incurs prohibitive computational costs. To address these challenges, we propose CLEANER. Distinct from external filtering methods, CLEANER exploits the model's intrinsic self-correction capabilities to eliminate error-contaminated context directly during data collection. At its core, the Similarity-Aware Adaptive Rollback (SAAR) mechanism autonomously constructs clean, purified trajectories by retrospectively replacing failures with successful self-corrections. Based on semantic similarity, SAAR adaptively regulates replacement granularity from shallow execution repairs to deep reasoning substitutions. By training on these self-purified paths, the model internalizes correct reasoning patterns rather than error-recovery loops. Empirical results on AIME24/25, GPQA, and LiveCodeBench show average accuracy gains of 6%, 3%, and 5% over baselines. Notably, CLEANER matches state-of-the-art performance using only one-third of the training steps, highlighting trajectory purification as a scalable solution for efficient agentic RL. Our models and code are available at GitHub

View on arXiv PDF

Similar