AICVAug 1, 2025

CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding

arXiv:2508.00378v3h-index: 1
Originality Incremental advance
AI Analysis

This addresses the issue of unreliable reasoning in vision-language models for applications requiring trustworthy multimodal AI, though it is incremental as it builds on existing chain-of-thought methods.

The paper tackles the problem of hallucinations in multimodal reasoning with vision-language models by introducing CoRGI, a framework that verifies chain-of-thought outputs through post-hoc visual grounding, resulting in improved answer accuracy and explanation faithfulness across multiple benchmarks and backbones.

Multimodal reasoning with vision-language models (VLMs) often suffers from hallucinations, as models tend to generate explanations after only a superficial inspection of the image. We present \textbf{CoRGI}(\textbf{C}hain \textbf{o}f \textbf{R}easoning with \textbf{G}rounded \textbf{I}nsights), a framework that enhances reasoning reliability through post-hoc verification of chain-of-thought outputs. Given a VLM-generated rationale, CoRGI decomposes it into step-wise statements, grounds each step in visual evidence, and filters or corrects unsupported claims before producing the final answer. Experiments on five challenging benchmark-VCR, ScienceQA, MMMU, MathVista, and HallusionBenc-demonstrate that CoRGI consistently improves both answer accuracy and explanation faithfulness across multiple VLM backbones, including Qwen-2.5VL, LLaVA-1.6, and Gemma3-12B. Beyond quantitative gains, qualitative analyses further illustrate how the verification process reduces hallucination and strengthens interpretability, suggesting that post-hoc visual grounding is a promising direction for building more trustworthy and transparent multimodal reasoning systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes