Exploring and Benchmarking the Planning Capabilities of Large Language Models
This work addresses planning capabilities in LLMs, which is a domain-specific problem for AI researchers, but it is incremental as it builds on existing methods with new benchmarks and evaluations.
The paper tackled the difficulty of planning tasks for large language models by constructing a comprehensive benchmark suite and evaluating methods like in-context learning and fine-tuning to enhance performance, showing improvements in systematic evaluation and generalization.
Classical and natural language planning tasks remain a difficult domain for modern large language models (LLMs). In this work, we lay the foundations for improving planning capabilities of LLMs. First, we construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios. This suite includes algorithms to methodically generate instances of tasks with varying levels of difficulty, allowing for rigorous and systematic evaluation of LLM performance. Next, we investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance. In addition, we demonstrate the positive impact of fine-tuning LLMs on optimal planning paths. We also probe the efficacy of chain-of-thought reasoning methods to improve LLM planning performance. Moreover, we probe the performance of the proposed methods in out-of-distribution scenarios, assessing the ability to generalize to novel and unseen planning challenges. Finally, we investigate model's failure modes and reveal insights that hold true across different benchmarks.