LGSep 25, 2025

MMPlanner: Zero-Shot Multimodal Procedural Planning with Chain-of-Thought Object State Reasoning

arXiv:2509.21662v11 citationsh-index: 18EMNLP
Originality Incremental advance
AI Analysis

It addresses the problem of multimodal procedural planning for applications like recipe or wiki instructions, with incremental advancements in evaluation and prompting.

The paper tackles the challenge of generating step-by-step multimodal instructions that maintain object-state consistency across text and images, achieving state-of-the-art improvements of +6.8% in textual planning, +11.9% in cross-modal alignment, and +26.7% in visual step ordering.

Multimodal Procedural Planning (MPP) aims to generate step-by-step instructions that combine text and images, with the central challenge of preserving object-state consistency across modalities while producing informative plans. Existing approaches often leverage large language models (LLMs) to refine textual steps; however, visual object-state alignment and systematic evaluation are largely underexplored. We present MMPlanner, a zero-shot MPP framework that introduces Object State Reasoning Chain-of-Thought (OSR-CoT) prompting to explicitly model object-state transitions and generate accurate multimodal plans. To assess plan quality, we design LLM-as-a-judge protocols for planning accuracy and cross-modal alignment, and further propose a visual step-reordering task to measure temporal coherence. Experiments on RECIPEPLAN and WIKIPLAN show that MMPlanner achieves state-of-the-art performance, improving textual planning by +6.8%, cross-modal alignment by +11.9%, and visual step ordering by +26.7%

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes