AIJun 18, 2024

Navigating the Labyrinth: Evaluating LLMs' Ability to Reason About Search Problems

Nasim Borazjanizadeh, Roei Herzig, Trevor Darrell, Rogerio Feris, Leonid Karlinsky

arXiv:2406.12172v35.82 citations

Originality Incremental advance

AI Analysis

This addresses a critical limitation in LLMs' reasoning abilities for AI researchers, though it is incremental as it builds on existing in-context learning techniques.

The paper tackles the problem of LLMs struggling with search problems requiring backtracking, introducing a new benchmark called SearchBench, and shows that advanced models like GPT-4 initially solve only 1.4% of problems, but performance improves to over 57% with a multi-stage inference method.

Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks. However, they often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, which contains 11 unique search problems inspired by intuitive puzzles. Each SearchBench problem type is equipped with automated pipelines to generate an arbitrary number of instances and analyze the feasibility, correctness, and optimality of LLM-generated solutions. We show that using step-by-step, language-only reasoning, even the most advanced LLMs fail to solve SearchBench; for example, OpenAI's frontier models GPT-4 and o1-preview solve only 1.4% and 18.6% of problems, respectively. The reason is that SearchBench problems require considering multiple pathways and performing backtracking, posing a significant challenge to auto-regressive models. Interestingly, performance is significantly boosted when we prompt models to generate a complete A* search algorithm - a comparatively more cognitively difficult task. This approach effectively offloads the iterative search and backtracking process from the models, which they struggle with in text. This in-context learning baseline is further enhanced via a Multi-Stage-Multi-Try (MSMT) inference method, increasing GPT-4's rate of correct solutions to over 57%.

View on arXiv PDF

Similar