CLCVSep 3, 2025

ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

arXiv:2509.02949v14 citationsh-index: 4
Originality Synthesis-oriented
AI Analysis

This provides a new evaluation dataset for developing procedural-activity assistants in assembly tasks, which is incremental as it builds on existing multimodal QA approaches.

The authors tackled the lack of practical testbeds for evaluating assembly task assistants by creating ProMQA-Assembly, a multimodal QA dataset with 391 QA pairs based on human-activity recordings and instruction manuals, and benchmarked competitive proprietary multimodal models, showing significant room for improvement.

Assistants on assembly tasks have a large potential to benefit humans from everyday tasks to industrial settings. However, no testbeds support application-oriented system evaluation in a practical setting, especially in assembly. To foster the development, we propose a new multimodal QA dataset on assembly activities. Our dataset, ProMQA-Assembly, consists of 391 QA pairs that require the multimodal understanding of human-activity recordings and their instruction manuals in an online-style manner. In the development, we adopt a semi-automated QA annotation approach, where LLMs generate candidates and humans verify them, as a cost-effective method, and further improve it by integrating fine-grained action labels to diversify question types. Furthermore, we create instruction task graphs for the target tasks of assembling toy vehicles. These newly created task graphs are used in our benchmarking experiment, as well as to facilitate the human verification process in the QA annotation. Utilizing our dataset, we benchmark models, including competitive proprietary multimodal models. Our results suggest great room for improvement for the current models. We believe our new evaluation dataset can contribute to the further development of procedural-activity assistants.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes