IRAICLFeb 1

MARA: A Multimodal Adaptive Retrieval-Augmented Framework for Document Question Answering

arXiv:2604.163132 citationsh-index: 19
Originality Highly original
AI Analysis

This work addresses limitations in multimodal document QA for applications like document analysis, though it is incremental as it builds on retrieval-augmented generation approaches.

The paper tackled the problem of retrieval-based multimodal document question answering by proposing the MARA framework, which introduces query-adaptive mechanisms for retrieval and generation, resulting in consistent improvements in retrieval relevance and answer quality over existing state-of-the-art methods across six benchmarks.

Retrieval-based multimodal document QA aims to identify and integrate relevant information from visually rich documents with complex multimodal structures. While retrieval-augmented generation (RAG) has shown strong performance in text-based QA, its extensions to multimodal documents remain underexplored and face significant limitations. Specifically, current approaches rely on query-agnostic document representations that overlook salient content and use static top-k evidence selection, which fails to adapt to the uncertain distribution of relevant information. To address these limitations, we propose the Multimodal Adaptive Retrieval-Augmented (MARA) framework, which introduces query-adaptive mechanisms to both retrieval and generation. MARA consists of two components: a Query-Aligned Region Encoder that builds multi-level document representations and reweights them based on query relevance to improve retrieval precision; and a Self-Reflective Evidence Controller that monitors evidence sufficiency during generation and adaptively incorporates content from lower-ranked sources using a sliding-window strategy. Experiments on six multimodal QA benchmarks demonstrate that MARA consistently improves retrieval relevance and answer quality over existing SOTA method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes