Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA
For researchers and practitioners in vision-language reasoning, this work addresses the underexplored task of procedural QA with a new benchmark and a method that yields significant gains.
The paper introduces ProcedureVQA, a benchmark for visual procedure question answering, and proposes Chain-of-Procedure (CoP), a hierarchical reasoning framework that improves next-step prediction by up to 13% absolute over standard baselines across six VLMs.
Recent advances in vision-language models (VLMs) have achieved impressive results on standard image-text tasks, yet their potential for visual procedure question answering (VP-QA) remains largely unexplored. VP-QA presents unique challenges where users query next-step actions by uploading images for intermediate states of complex procedures. To systematically evaluate VLMs on this practical task, we propose ProcedureVQA, a novel multimodal benchmark specifically designed for visual procedural reasoning. Through comprehensive analysis, we identify two critical limitations in current VLMs: inadequate cross-modal retrieval of structured procedures given visual states, and misalignment between image sequence granularity and textual step decomposition. To address these issues, we present Chain-of-Procedure (CoP), a hierarchical reasoning framework that first retrieves relevant instructions using visual cues, then performs step refinement through semantic decomposition, and finally generates the next step. Experiments across six VLMs demonstrate CoP's effectiveness, achieving up to 13% absolute improvement over standard baselines.