LG AISep 7, 2024

Reward Guidance for Reinforcement Learning Tasks Based on Large Language Models: The LMGT Framework

Yongxin Deng, Xihe Qiu, Jue Chen, Xiaoyu Tan

arXiv:2409.04744v39.212 citationsh-index: 12

Originality Incremental advance

AI Analysis

This work addresses sample inefficiency in reinforcement learning for domains like robotics, offering a novel integration of LLMs, though it is incremental in combining existing techniques.

The paper tackles the challenge of balancing exploration and exploitation in reinforcement learning, especially in sparse-reward environments like robotic control, by proposing the LMGT framework that uses large language models to guide reward tuning, resulting in improved sample efficiency and reduced computational resources compared to baselines.

The inherent uncertainty in the environmental transition model of Reinforcement Learning (RL) necessitates a delicate balance between exploration and exploitation. This balance is crucial for optimizing computational resources to accurately estimate expected rewards for the agent. In scenarios with sparse rewards, such as robotic control systems, achieving this balance is particularly challenging. However, given that many environments possess extensive prior knowledge, learning from the ground up in such contexts may be redundant. To address this issue, we propose Language Model Guided reward Tuning (LMGT), a novel, sample-efficient framework. LMGT leverages the comprehensive prior knowledge embedded in Large Language Models (LLMs) and their proficiency in processing non-standard data forms, such as wiki tutorials. By utilizing LLM-guided reward shifts, LMGT adeptly balances exploration and exploitation, thereby guiding the agent's exploratory behavior and enhancing sample efficiency. We have rigorously evaluated LMGT across various RL tasks and evaluated it in the embodied robotic environment Housekeep. Our results demonstrate that LMGT consistently outperforms baseline methods. Furthermore, the findings suggest that our framework can substantially reduce the computational resources required during the RL training phase.

View on arXiv PDF

Similar