AIMay 19

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

arXiv:2605.1944789.1
AI Analysis

For researchers building interactive LLM agents, SERL provides a principled way to leverage per-step feedback for credit assignment, addressing a key bottleneck in long-horizon tasks.

SERL improves multi-turn LLM agents by selectively using environment feedback to guide reinforcement learning, achieving 90.0% success on ALFWorld and 80.1% on WebShop, outperforming existing RL and distillation methods.

Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment-reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis shows that grounded, action-relevant feedback at meaningful points consistently outperforms indiscriminate use of longer or richer context.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes