CVCLFeb 16, 2024

II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

arXiv:2402.11058v330 citationsh-index: 22ACL
Originality Incremental advance
AI Analysis

This work addresses the challenge of complex reasoning in VQA for AI systems, though it is incremental as it builds on existing prompting techniques.

The paper tackles the problem of evaluating and improving multi-modal multi-hop reasoning in Visual Question Answering (VQA), showing that most VQA questions are easy single-hop cases, and II-MMR effectively handles complex multi-hop reasoning, outperforming traditional methods on benchmarks like GQA and A-OKVQA.

Visual Question Answering (VQA) often involves diverse reasoning scenarios across Vision and Language (V&L). Most prior VQA studies, however, have merely focused on assessing the model's overall accuracy without evaluating it on different reasoning cases. Furthermore, some recent works observe that conventional Chain-of-Thought (CoT) prompting fails to generate effective reasoning for VQA, especially for complex scenarios requiring multi-hop reasoning. In this paper, we propose II-MMR, a novel idea to identify and improve multi-modal multi-hop reasoning in VQA. In specific, II-MMR takes a VQA question with an image and finds a reasoning path to reach its answer using two novel language promptings: (i) answer prediction-guided CoT prompt, or (ii) knowledge triplet-guided prompt. II-MMR then analyzes this path to identify different reasoning cases in current VQA benchmarks by estimating how many hops and what types (i.e., visual or beyond-visual) of reasoning are required to answer the question. On popular benchmarks including GQA and A-OKVQA, II-MMR observes that most of their VQA questions are easy to answer, simply demanding "single-hop" reasoning, whereas only a few questions require "multi-hop" reasoning. Moreover, while the recent V&L model struggles with such complex multi-hop reasoning questions even using the traditional CoT method, II-MMR shows its effectiveness across all reasoning cases in both zero-shot and fine-tuning settings.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes