CVAug 15, 2025

Generating Dialogues from Egocentric Instructional Videos for Task Assistance: Dataset, Method and Benchmark

arXiv:2508.11192v11 citationsh-index: 7
Originality Incremental advance
AI Analysis

This provides a dataset for AI agents to assist users in complex, multi-step tasks like cooking and mechanics, though it is incremental as it builds on existing video data with automated generation.

The paper tackles the scarcity of dialogue-video datasets for real-world task assistance by proposing an automated method using large language models to convert single-person instructional videos into two-person dialogues, resulting in the HowToDIV dataset with 507 conversations, 6,636 question-answer pairs, and 24 hours of video clips across diverse tasks.

Many everyday tasks ranging from fixing appliances, cooking recipes to car maintenance require expert knowledge, especially when tasks are complex and multi-step. Despite growing interest in AI agents, there is a scarcity of dialogue-video datasets grounded for real world task assistance. In this paper, we propose a simple yet effective approach that transforms single-person instructional videos into task-guidance two-person dialogues, aligned with fine grained steps and video-clips. Our fully automatic approach, powered by large language models, offers an efficient alternative to the substantial cost and effort required for human-assisted data collection. Using this technique, we build HowToDIV, a large-scale dataset containing 507 conversations, 6636 question-answer pairs and 24 hours of videoclips across diverse tasks in cooking, mechanics, and planting. Each session includes multi-turn conversation where an expert teaches a novice user how to perform a task step by step, while observing user's surrounding through a camera and microphone equipped wearable device. We establish the baseline benchmark performance on HowToDIV dataset through Gemma-3 model for future research on this new task of dialogues for procedural-task assistance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes