AICLJan 26

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

arXiv:2601.18137v115 citationsh-index: 7Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of evaluating practical agent planning for researchers, but it is incremental as it builds on existing benchmark efforts.

The authors tackled the lack of benchmarks for long-horizon agent planning with verifiable constraints by introducing DeepPlanning, a benchmark featuring multi-day travel and shopping tasks, and found that even state-of-the-art LLMs struggle with these problems, highlighting the need for better reasoning patterns and tool use.

While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes