LGSep 25, 2025

MMPlanner: Zero-Shot Multimodal Procedural Planning with Chain-of-Thought Object State Reasoning

Afrina Tabassum, Bin Guo, Xiyao Ma, Hoda Eldardiry, Ismini Lourentzou

arXiv:2509.21662v11 citationsh-index: 18EMNLP

Originality Incremental advance

AI Analysis

It addresses the problem of multimodal procedural planning for applications like recipe or wiki instructions, with incremental advancements in evaluation and prompting.

The paper tackles the challenge of generating step-by-step multimodal instructions that maintain object-state consistency across text and images, achieving state-of-the-art improvements of +6.8% in textual planning, +11.9% in cross-modal alignment, and +26.7% in visual step ordering.

Multimodal Procedural Planning (MPP) aims to generate step-by-step instructions that combine text and images, with the central challenge of preserving object-state consistency across modalities while producing informative plans. Existing approaches often leverage large language models (LLMs) to refine textual steps; however, visual object-state alignment and systematic evaluation are largely underexplored. We present MMPlanner, a zero-shot MPP framework that introduces Object State Reasoning Chain-of-Thought (OSR-CoT) prompting to explicitly model object-state transitions and generate accurate multimodal plans. To assess plan quality, we design LLM-as-a-judge protocols for planning accuracy and cross-modal alignment, and further propose a visual step-reordering task to measure temporal coherence. Experiments on RECIPEPLAN and WIKIPLAN show that MMPlanner achieves state-of-the-art performance, improving textual planning by +6.8%, cross-modal alignment by +11.9%, and visual step ordering by +26.7%

View on arXiv PDF

Similar