CLApr 19, 2023

MPMQA: Multimodal Question Answering on Product Manuals

arXiv:2304.09660v116 citationsh-index: 17Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for multimodal understanding in product manuals for users and developers, but it is incremental as it builds on existing PMQA by adding visual elements.

The authors tackled the problem of product manual question answering by introducing a multimodal task that requires processing and generating both textual and visual answers, constructing a large-scale dataset PM209 with 22,021 annotated question-answer pairs from 209 manuals.

Visual contents, such as illustrations and images, play a big role in product manual understanding. Existing Product Manual Question Answering (PMQA) datasets tend to ignore visual contents and only retain textual parts. In this work, to emphasize the importance of multimodal contents, we propose a Multimodal Product Manual Question Answering (MPMQA) task. For each question, MPMQA requires the model not only to process multimodal contents but also to provide multimodal answers. To support MPMQA, a large-scale dataset PM209 is constructed with human annotations, which contains 209 product manuals from 27 well-known consumer electronic brands. Human annotations include 6 types of semantic regions for manual contents and 22,021 pairs of question and answer. Especially, each answer consists of a textual sentence and related visual regions from manuals. Taking into account the length of product manuals and the fact that a question is always related to a small number of pages, MPMQA can be naturally split into two subtasks: retrieving most related pages and then generating multimodal answers. We further propose a unified model that can perform these two subtasks all together and achieve comparable performance with multiple task-specific models. The PM209 dataset is available at https://github.com/AIM3-RUC/MPMQA.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes