What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
For researchers building interactive LLM agents, SERL provides a principled way to leverage per-step feedback for credit assignment, addressing a key bottleneck in long-horizon tasks.
SERL improves multi-turn LLM agents by selectively using environment feedback to guide reinforcement learning, achieving 90.0% success on ALFWorld and 80.1% on WebShop, outperforming existing RL and distillation methods.
Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment-reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis shows that grounded, action-relevant feedback at meaningful points consistently outperforms indiscriminate use of longer or richer context.