CLAILGJun 17, 2024

Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement

arXiv:2406.11176v283 citations
Originality Incremental advance
AI Analysis

This work addresses the issue of error-prone agent actions in interactive tasks by providing process supervision, offering an incremental improvement over existing methods that focus on outcome rewards.

The paper tackles the problem of suboptimal actions in large language model agents by introducing the Iterative step-level Process Refinement (IPR) framework, which uses step-by-step guidance and Monte Carlo-based step-level rewards to improve training, resulting in outperformance of strong baselines on three complex agent tasks.

Large language model agents have exhibited exceptional performance across a range of complex interactive tasks. Recent approaches have utilized tuning with expert trajectories to enhance agent performance, yet they primarily concentrate on outcome rewards, which may lead to errors or suboptimal actions due to the absence of process supervision signals. In this paper, we introduce the Iterative step-level Process Refinement (IPR) framework, which provides detailed step-by-step guidance to enhance agent training. Specifically, we adopt the Monte Carlo method to estimate step-level rewards. During each iteration, the agent explores along the expert trajectory and generates new actions. These actions are then evaluated against the corresponding step of expert trajectory using step-level rewards. Such comparison helps identify discrepancies, yielding contrastive action pairs that serve as training data for the agent. Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines. Moreover, our analytical findings highlight the effectiveness of IPR in augmenting action efficiency and its applicability to diverse models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes