AICLMar 22

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

arXiv:2603.2135785.21 citationsh-index: 37
Predicted impact top 28% in AI · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the inefficiency of discarding failed trajectories in LLM agent training, offering a practical data augmentation method for improving agent performance on tasks like web navigation and tool use.

The paper tackles the problem of LLM agents failing on most real-world tasks by introducing AgentHER, a framework that converts failed agent trajectories into training data through hindsight relabeling. The result shows improvements of +7.1-11.7 percentage points over success-only training across multiple model families and achieves 2x data efficiency.

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely discarded, wasting the dominant source of collected experience. We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation. The key insight is simple: a trajectory that fails goal A is often a correct demonstration for some achievable alternative goal B. AgentHER realises this idea through a four-stage pipeline -- failure classification, outcome extraction, LLM-guided prompt relabeling with confidence gating, and data packaging -- that converts discarded failures into high-quality SFT, DPO, and ShareGPT training data, with both zero-cost rule-based and LLM-judge implementations. On WebArena (Zhou et al., 2024) and ToolBench (Qin et al., 2024), AgentHER improves over success-only SFT by +7.1-11.7 pp across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), while achieving 2x data efficiency -- matching baseline performance with only 50% of successful demonstrations. Gains are consistent from 1.5B to 72B parameters (+5.8-9.2 pp) and compound under iterative redeployment (+2.1 pp over additional rounds). Human evaluation confirms 97.7% relabeling precision under multi-judge verification.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes