LGCLJun 24, 2025

Learning Instruction-Following Policies through Open-Ended Instruction Relabeling with Large Language Models

CMU
arXiv:2506.20061v11 citationsh-index: 8
Originality Highly original
AI Analysis

This addresses the problem of data inefficiency and human annotation burden in reinforcement learning for AI agents, though it is incremental as it builds on existing LLM and instruction-following methods.

The paper tackles the challenge of learning instruction-following policies in reinforcement learning by using large language models to automatically generate open-ended instructions from agent trajectories, reducing reliance on human-labeled data. It demonstrates improvements in sample efficiency, instruction coverage, and policy performance in the Craftax environment compared to state-of-the-art baselines.

Developing effective instruction-following policies in reinforcement learning remains challenging due to the reliance on extensive human-labeled instruction datasets and the difficulty of learning from sparse rewards. In this paper, we propose a novel approach that leverages the capabilities of large language models (LLMs) to automatically generate open-ended instructions retrospectively from previously collected agent trajectories. Our core idea is to employ LLMs to relabel unsuccessful trajectories by identifying meaningful subtasks the agent has implicitly accomplished, thereby enriching the agent's training data and substantially alleviating reliance on human annotations. Through this open-ended instruction relabeling, we efficiently learn a unified instruction-following policy capable of handling diverse tasks within a single policy. We empirically evaluate our proposed method in the challenging Craftax environment, demonstrating clear improvements in sample efficiency, instruction coverage, and overall policy performance compared to state-of-the-art baselines. Our results highlight the effectiveness of utilizing LLM-guided open-ended instruction relabeling to enhance instruction-following reinforcement learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes