ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models
This work addresses a gap in evaluating visual planning approaches for researchers in AI and robotics, though it is incremental as it focuses on benchmarking rather than new methods.
The paper tackles the lack of common benchmarks for comparing visual planning methods using Vision-Language Models (VLMs) by introducing ViPlan, an open-source benchmark with tasks in Blocksworld and household robotics, finding that symbolic planning outperforms direct VLM planning in Blocksworld but not in household robotics, and showing no significant benefit from Chain-of-Thought prompting across most models.
Integrating Large Language Models with symbolic planners is a promising direction for obtaining verifiable and grounded plans compared to planning in natural language, with recent works extending this idea to visual domains using Vision-Language Models (VLMs). However, rigorous comparison between VLM-grounded symbolic approaches and methods that plan directly with a VLM has been hindered by a lack of common environments, evaluation protocols and model coverage. We introduce ViPlan, the first open-source benchmark for Visual Planning with symbolic predicates and VLMs. ViPlan features a series of increasingly challenging tasks in two domains: a visual variant of the classic Blocksworld planning problem and a simulated household robotics environment. We benchmark nine open-source VLM families across multiple sizes, along with selected closed models, evaluating both VLM-grounded symbolic planning and using the models directly to propose actions. We find symbolic planning to outperform direct VLM planning in Blocksworld, where accurate image grounding is crucial, whereas the opposite is true in the household robotics tasks, where commonsense knowledge and the ability to recover from errors are beneficial. Finally, we show that across most models and methods, there is no significant benefit to using Chain-of-Thought prompting, suggesting that current VLMs still struggle with visual reasoning.