CV CLFeb 22

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Diogo Glória-Silva, David Semedo, João Maglhães

arXiv:2602.19146v11.5h-index: 9

Originality Incremental advance

AI Analysis

This addresses the need for multimodal dialogue systems that can provide grounded guidance for tasks like cooking and DIY, though it appears incremental by building on prior work in video and language integration.

The paper tackles the problem of understanding and reasoning over complex instructional videos in a dialogue setting, introducing VIGiA, which outperforms state-of-the-art models with over 90% accuracy on plan-aware VQA tasks.

We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.

View on arXiv PDF

Similar