AI RODec 12, 2023

Sequential Planning in Large Partially Observable Environments guided by LLMs

arXiv:2312.07368v13 citationsh-index: 4

Originality Incremental advance

AI Analysis

This addresses the challenge of combinatorial explosion in planning for AI agents, though it appears to be an incremental improvement over existing LLM-based methods.

The paper tackles the problem of sequential planning in large partially observable environments by proposing a hybrid agent called 'neoplanner' that combines state space search with LLM queries, resulting in a 124% improvement in average reward over the current best method in the Scienceworld environment.

Sequential planning in large state space and action space quickly becomes intractable due to combinatorial explosion of the search space. Heuristic methods, like monte-carlo tree search, though effective for large state space, but struggle if action space is large. Pure reinforcement learning methods, relying only on reward signals, needs prohibitively large interactions with the environment to device a viable plan. If the state space, observations and actions can be represented in natural language then Large Language models (LLM) can be used to generate action plans. Recently several such goal-directed agents like Reflexion, CLIN, SayCan were able to surpass the performance of other state-of-the-art methods with minimum or no task specific training. But they still struggle with exploration and get stuck in local optima. Their planning capabilities are limited by the limited reasoning capability of the foundational LLMs on text data. We propose a hybrid agent "neoplanner", that synergizes both state space search with queries to foundational LLM to get the best action plan. The reward signals are quantitatively used to drive the search. A balance of exploration and exploitation is maintained by maximizing upper confidence bounds of values of states. In places where random exploration is needed, the LLM is queried to generate an action plan. Learnings from each trial are stored as entity relationships in text format. Those are used in future queries to the LLM for continual improvement. Experiments in the Scienceworld environment reveals a 124% improvement from the current best method in terms of average reward gained across multiple tasks.

View on arXiv PDF

Similar