Revisiting the Travel Planning Capabilities of Large Language Models
For researchers and developers of LLMs, this work provides a diagnostic framework to identify specific weaknesses in travel planning, enabling targeted improvements.
The paper decomposes travel planning into five atomic sub-capabilities and evaluates LLMs on each in isolation, finding that while LLMs excel at extracting explicit constraints, they struggle with implicit requirements, exhibit structural biases in plan generation, and suffer from ineffective self-correction.
Travel planning serves as a critical task for long-horizon reasoning, exposing significant deficits in LLMs. However, existing benchmarks and evaluations primarily assess final plans in an end-to-end manner, which lacks interpretability and makes it difficult to analyze the root causes of failures. To bridge this gap, we decompose travel planning into five constituent atomic sub-capabilities, including \emph{Constraint Extraction}, \emph{Tool Use}, \emph{Plan Generation}, \emph{Error Identification}, and \emph{Error Correction}. We implement a decoupled evaluation protocol leveraging oracle intermediate contexts to rigorously isolate these components, thereby measuring the atomic performance boundary without the noise of cascading errors. Our results highlight a clear contrast in performance: while LLMs are proficient in extracting explicit constraints, they struggle to infer implicit, open-world requirements. Furthermore, they exhibit structural biases in plan generation and suffer from ineffective self-correction, characterized by excessive sensitivity and erroneous persistence. These findings offer precise directions for improving LLM reasoning and planning abilities.