CVDec 19, 2024

Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering

arXiv:2412.14880v1h-index: 12
Originality Highly original
AI Analysis

This addresses a specific bottleneck in multi-modal AI systems for researchers and practitioners, though it is incremental in nature.

The paper tackles the problem of cascading errors in retrieval-based multi-image question answering by proposing a method that uses multimodal hypothetical summaries to transform retrieval into text-to-text, achieving a 3.7% absolute improvement over state-of-the-art methods on RETVQA and a 14.5% improvement over CLIP.

Retrieval-based multi-image question answering (QA) task involves retrieving multiple question-related images and synthesizing these images to generate an answer. Conventional "retrieve-then-answer" pipelines often suffer from cascading errors because the training objective of QA fails to optimize the retrieval stage. To address this issue, we propose a novel method to effectively introduce and reference retrieved information into the QA. Given the image set to be retrieved, we employ a multimodal large language model (visual perspective) and a large language model (textual perspective) to obtain multimodal hypothetical summary in question-form and description-form. By combining visual and textual perspectives, MHyS captures image content more specifically and replaces real images in retrieval, which eliminates the modality gap by transforming into text-to-text retrieval and helps improve retrieval. To more advantageously introduce retrieval with QA, we employ contrastive learning to align queries (questions) with MHyS. Moreover, we propose a coarse-to-fine strategy for calculating both sentence-level and word-level similarity scores, to further enhance retrieval and filter out irrelevant details. Our approach achieves a 3.7% absolute improvement over state-of-the-art methods on RETVQA and a 14.5% improvement over CLIP. Comprehensive experiments and detailed ablation studies demonstrate the superiority of our method.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes