LG AIFeb 20

MIRA: Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance

arXiv:2602.17930v11.4h-index: 1

Originality Incremental advance

AI Analysis

This work addresses scalability and reliability issues in RL for sparse-reward environments, though it is incremental as it builds on existing LLM-guided RL methods.

The paper tackles the problem of high sample complexity in reinforcement learning agents in sparse or delayed reward settings by proposing MIRA, a memory-integrated agent that uses a structured memory graph to guide training with limited LLM guidance, achieving returns comparable to methods with frequent LLM supervision while requiring substantially fewer online LLM queries.

Reinforcement learning (RL) agents often suffer from high sample complexity in sparse or delayed reward settings due to limited prior structure. Large language models (LLMs) can provide subgoal decompositions, plausible trajectories, and abstract priors that facilitate early learning. However, heavy reliance on LLM supervision introduces scalability constraints and dependence on potentially unreliable signals. We propose MIRA (Memory-Integrated Reinforcement Learning Agent), which incorporates a structured, evolving memory graph to guide early training. The graph stores decision-relevant information, including trajectory segments and subgoal structures, and is constructed from both the agent's high-return experiences and LLM outputs. This design amortizes LLM queries into a persistent memory rather than requiring continuous real-time supervision. From this memory graph, we derive a utility signal that softly adjusts advantage estimation to influence policy updates without modifying the underlying reward function. As training progresses, the agent's policy gradually surpasses the initial LLM-derived priors, and the utility term decays, preserving standard convergence guarantees. We provide theoretical analysis showing that utility-based shaping improves early-stage learning in sparse-reward environments. Empirically, MIRA outperforms RL baselines and achieves returns comparable to approaches that rely on frequent LLM supervision, while requiring substantially fewer online LLM queries. Project webpage: https://narjesno.github.io/MIRA/

View on arXiv PDF

Similar