CLAILGFeb 16, 2024

When is Tree Search Useful for LLM Planning? It Depends on the Discriminator

Microsoft
arXiv:2402.10890v263 citationsh-index: 42Has CodeACL
Originality Synthesis-oriented
AI Analysis

This work addresses the practical deployment challenges of LLM-based planning methods for researchers and developers, highlighting incremental insights into performance bottlenecks.

The paper investigates the utility of advanced planning methods like iterative correction and tree search for LLMs in multi-step problems, finding that they require discriminators with at least 90% accuracy to significantly outperform simpler re-ranking, but current LLM discriminators fall short, leading to negligible gains and inefficiencies such as tree search being 10-20 times slower.

In this paper, we examine how large language models (LLMs) solve multi-step problems under a language agent framework with three components: a generator, a discriminator, and a planning method. We investigate the practical utility of two advanced planning methods, iterative correction and tree search. We present a comprehensive analysis of how discrimination accuracy affects the overall performance of agents when using these two methods or a simpler method, re-ranking. Experiments on two tasks, text-to-SQL parsing and mathematical reasoning, show that: (1) advanced planning methods demand discriminators with at least 90% accuracy to achieve significant improvements over re-ranking; (2) current LLMs' discrimination abilities have not met the needs of advanced planning methods to achieve such improvements; (3) with LLM-based discriminators, advanced planning methods may not adequately balance accuracy and efficiency. For example, compared to the other two methods, tree search is at least 10--20 times slower but leads to negligible performance gains, which hinders its real-world applications. Code and data are available at https://github.com/OSU-NLP-Group/llm-planning-eval.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes