LG AIMay 29

When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?

arXiv:2605.3071993.0h-index: 1

Predicted impact top 6% in LG · last 90 daysOriginality Highly original

AI Analysis

This work addresses the problem of using LLMs for policy optimization in sequential RL tasks, offering a potentially more efficient alternative for practitioners and researchers in RL.

This paper investigates the effectiveness of Large Language Models (LLMs) as black-box policy optimizers for Reinforcement Learning (RL) tasks. The proposed method, Prompted Policy Optimization (PromptPO), iteratively prompts an LLM to generate and refine executable policies based on rollout feedback, often matching or exceeding standard RL baselines with substantially fewer environment interactions across various tasks.

We study when large language models (LLMs) can serve as effective black-box policy optimizers for reinforcement learning (RL) tasks, i.e., when can we replace classical RL algorithms with an LLM? We explore this question by introducing Prompted Policy Optimization (PromptPO), an iterative method that prompts an LLM with Python descriptions of the state space, action space, and reward function, then has it generate and refine executable policies based on rollout feedback. Across hard exploration environments, Meta-World robotics tasks, and several real-world control problems, PromptPO often matches or exceeds the performance of standard RL baselines while using substantially fewer environment interactions. To maximize expected return, and without further explicit prompting, the policies PromptPO outputs range from tuned proportional controllers or rule-based plans to policies that run planning algorithms like value iteration. Our results demonstrate that LLM-based policy optimization is sufficient when the LLM can leverage prior knowledge about the environment or optimization strategy. PromptPO underperforms standard RL baselines in MuJoCo domains. This demonstrates possible limitations of LLM-based policy optimization to settings that requiring fine-grained continuous control.

View on arXiv PDF

Similar