CLOct 4, 2023

Multimodal Question Answering for Unified Information Extraction

Microsoft
arXiv:2310.03017v111 citationsh-index: 42
Originality Highly original
AI Analysis

This work addresses the generalization and data efficiency challenges in MIE for real-world applications with diverse tasks and limited labeled data, representing a novel method rather than an incremental improvement.

The paper tackles the problem of multimodal information extraction (MIE) by proposing a multimodal question answering (MQA) framework that unifies three MIE tasks into a span extraction and multi-choice QA pipeline, resulting in significant performance improvements over baselines, including outperforming state-of-the-art in zero-shot settings and enhancing 10B-parameter models to compete with larger models like GPT-4.

Multimodal information extraction (MIE) aims to extract structured information from unstructured multimedia content. Due to the diversity of tasks and settings, most current MIE models are task-specific and data-intensive, which limits their generalization to real-world scenarios with diverse task requirements and limited labeled data. To address these issues, we propose a novel multimodal question answering (MQA) framework to unify three MIE tasks by reformulating them into a unified span extraction and multi-choice QA pipeline. Extensive experiments on six datasets show that: 1) Our MQA framework consistently and significantly improves the performances of various off-the-shelf large multimodal models (LMM) on MIE tasks, compared to vanilla prompting. 2) In the zero-shot setting, MQA outperforms previous state-of-the-art baselines by a large margin. In addition, the effectiveness of our framework can successfully transfer to the few-shot setting, enhancing LMMs on a scale of 10B parameters to be competitive or outperform much larger language models such as ChatGPT and GPT-4. Our MQA framework can serve as a general principle of utilizing LMMs to better solve MIE and potentially other downstream multimodal tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes