AICLLGDec 29, 2025

Replay Failures as Successes: Sample-Efficient Reinforcement Learning for Instruction Following

arXiv:2512.23457v14 citationsh-index: 12Has Code
Originality Highly original
AI Analysis

This work addresses sample inefficiency in RL for instruction-following tasks, offering a method to improve training efficiency for AI alignment applications.

The paper tackles the problem of sparse rewards in reinforcement learning for aligning large language models to follow complex instructions, by proposing a hindsight instruction replay method that converts failed attempts into successful ones, achieving promising results with reduced computational costs.

Reinforcement Learning (RL) has shown promise for aligning Large Language Models (LLMs) to follow instructions with various constraints. Despite the encouraging results, RL improvement inevitably relies on sampling successful, high-quality responses; however, the initial model often struggles to generate responses that satisfy all constraints due to its limited capabilities, yielding sparse or indistinguishable rewards that impede learning. In this work, we propose Hindsight instruction Replay (HiR), a novel sample-efficient RL framework for complex instruction following tasks, which employs a select-then-rewrite strategy to replay failed attempts as successes based on the constraints that have been satisfied in hindsight. We perform RL on these replayed samples as well as the original ones, theoretically framing the objective as dual-preference learning at both the instruction- and response-level to enable efficient optimization using only a binary reward signal. Extensive experiments demonstrate that the proposed HiR yields promising results across different instruction following tasks, while requiring less computational budget. Our code and dataset is available at https://github.com/sastpg/HIR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes