CVAIMar 17

Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

Princeton
arXiv:2603.1666498.3h-index: 12
AI Analysis

This addresses a critical deployment issue for LVLMs in multimodal tasks, offering a cost-effective solution with interpretability gains.

The paper tackles the problem of hallucinations in large vision-language models (LVLMs) by proposing Kestrel, a training-free framework that combines visual-grounding with evidence-verified self-refinement, resulting in performance improvements such as +3.31% on POPE and +28.34 on MME-Hallucination benchmarks.

Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis -- e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes