CV AIOct 12, 2025

Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs

Suyang Xi, Chenxi Yang, Hong Ding, Yiqing Ni, Catherine C. Liu, Yunhao Liu, Chengqi Zhang

arXiv:2510.10426v12 citationsh-index: 3

Originality Incremental advance

AI Analysis

This work addresses the issue of unreliable reasoning in MLLMs for tasks requiring detailed visual understanding, representing an incremental improvement over existing retrieval-augmented generation methods.

The paper tackled the problem of fine-grained visual question answering in multimodal large language models (MLLMs), which often produce hallucinations due to poor visual grounding, and introduced the HuLiRAG framework to improve grounding fidelity and reduce hallucinations.

Multimodal large language models (MLLMs) often fail in fine-grained visual question answering, producing hallucinations about object identities, positions, and relations because textual queries are not explicitly anchored to visual referents. Retrieval-augmented generation (RAG) alleviates some errors, but it fails to align with human-like processing at both the retrieval and augmentation levels. Specifically, it focuses only on global-level image information but lacks local detail and limits reasoning about fine-grained interactions. To overcome this limitation, we present Human-Like Retrieval-Augmented Generation (HuLiRAG), a framework that stages multimodal reasoning as a ``what--where--reweight'' cascade. Queries are first anchored to candidate referents via open-vocabulary detection (what), then spatially resolved with SAM-derived masks to recover fine-grained precision (where), and adaptively prioritized through the trade-off between local and global alignment (reweight). Mask-guided fine-tuning further injects spatial evidence into the generation process, transforming grounding from a passive bias into an explicit constraint on answer formulation. Extensive experiments demonstrate that this human-like cascade improves grounding fidelity and factual consistency while reducing hallucinations, advancing multimodal question answering toward trustworthy reasoning.

View on arXiv PDF

Similar