AISep 30, 2025

GroundSight: Augmenting Vision-Language Models with Grounding Information and De-hallucination

arXiv:2509.25669v1
Originality Incremental advance
AI Analysis

This addresses hallucination and accuracy issues in vision-language models for VQA tasks, representing an incremental improvement with specific gains.

The paper tackled the problem of improving Visual Question Answering (VQA) by introducing text-grounded object localization and de-hallucination methods, resulting in an accuracy increase from 22.19% to 25.64% and a hallucination rate reduction from 65.79% to 13.88%.

We propose a method to improve Visual Question Answering (VQA) with Retrieval-Augmented Generation (RAG) by introducing text-grounded object localization. Rather than retrieving information based on the entire image, our approach enables the model to generate a bounding box around the object most relevant to the question, allowing for targeted image cropping and focused retrieval. This reduces background noise, improves alignment between visual and textual cues, and helps mitigate hallucinations. Our RAG method enhances context-aware VQA responses increased the accuracy from 22.19% to 25.64%, with an absolute increase of 3.45 percentage points, compared to the baseline Llama-3.2-Vision-11B agent. We also proposed a de-hallucination method based on question type which can effectively reduce the hallucination rate from 65.79% to 13.88% and improves the truthfulness score.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes