CVFeb 1

From Videos to Conversations: Egocentric Instructions for Task Assistance

arXiv:2602.01038v1
Originality Incremental advance
AI Analysis

This addresses the problem of limited data for augmented reality assistance systems, though it is incremental as it builds on existing video-to-text methods.

The paper tackles the scarcity of large-scale multimodal conversational datasets for AI agents in task assistance by automatically transforming instructional videos into expert-novice conversations, resulting in the HowToDIV dataset with 507 conversations and 6,636 QA pairs.

Many everyday tasks, ranging from appliance repair and cooking to car maintenance, require expert knowledge, particularly for complex, multi-step procedures. Despite growing interest in AI agents for augmented reality (AR) assistance, progress remains limited by the scarcity of large-scale multimodal conversational datasets grounded in real-world task execution, in part due to the cost and logistical complexity of human-assisted data collection. In this paper, we present a framework to automatically transform single person instructional videos into two-person multimodal task-guidance conversations. Our fully automatic pipeline, based on large language models, provides a scalable and cost efficient alternative to traditional data collection approaches. Using this framework, we introduce HowToDIV, a multimodal dataset comprising 507 conversations, 6,636 question answer pairs, and 24 hours of video spanning multiple domains. Each session consists of a multi-turn expert-novice interaction. Finally, we report baseline results using Gemma 3 and Qwen 2.5 on HowToDIV, providing an initial benchmark for multimodal procedural task assistance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes