CLOct 5, 2023

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning

arXiv:2310.03249v312.762 citationsh-index: 26Has Code

Originality Incremental advance

AI Analysis

This work addresses the limitation of LLMs in long-term planning and spatial reasoning for AI and robotics applications, but it is incremental as it builds on existing benchmarking and prompting methods.

The paper tackles the problem of evaluating large language models (LLMs) on spatial-temporal reasoning for path planning tasks, proposing the PPNL benchmark and showing that few-shot GPT-4 performs well in spatial reasoning but fails at long-term temporal reasoning, while fine-tuned models achieve high in-distribution results but struggle with generalization to larger or more complex environments.

Large language models (LLMs) have achieved remarkable success across a wide spectrum of tasks; however, they still face limitations in scenarios that demand long-term planning and spatial reasoning. To facilitate this line of research, in this work, we propose a new benchmark, termed $\textbf{P}$ath $\textbf{P}$lanning from $\textbf{N}$atural $\textbf{L}$anguage ($\textbf{PPNL}$). Our benchmark evaluates LLMs' spatial-temporal reasoning by formulating ''path planning'' tasks that require an LLM to navigate to target locations while avoiding obstacles and adhering to constraints. Leveraging this benchmark, we systematically investigate LLMs including GPT-4 via different few-shot prompting methodologies as well as BART and T5 of various sizes via fine-tuning. Our experimental results show the promise of few-shot GPT-4 in spatial reasoning, when it is prompted to reason and act interleavedly, although it still fails to perform long-term temporal reasoning. In contrast, while fine-tuned LLMs achieved impressive results on in-distribution reasoning tasks, they struggled to generalize to larger environments or environments with more obstacles.

View on arXiv PDF Code

Similar