CVAICLLGJun 26, 2025

HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation

arXiv:2506.21546v24 citationsh-index: 18
Originality Incremental advance
AI Analysis

This addresses the need for better evaluation of grounding fidelity in segmentation models, which is crucial for reliable visual understanding in AI applications, though it is incremental as it builds on existing hallucination evaluation efforts.

The paper tackles the problem of hallucinations in vision-language segmentation models, where models produce segmentation masks for objects not present in images, and introduces HalluSegBench, a benchmark with 1340 counterfactual instance pairs and new metrics, revealing that vision-driven hallucinations are more prevalent than label-driven ones.

Recent progress in vision-language segmentation has significantly advanced grounded visual understanding. However, these models often exhibit hallucinations by producing segmentation masks for objects not grounded in the image content or by incorrectly labeling irrelevant regions. Existing evaluation protocols for segmentation hallucination primarily focus on label or textual hallucinations without manipulating the visual context, limiting their capacity to diagnose critical failures. In response, we introduce HalluSegBench, the first benchmark specifically designed to evaluate hallucinations in visual grounding through the lens of counterfactual visual reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual instance pairs spanning 281 unique object classes, and a set of newly introduced metrics that quantify hallucination sensitivity under visually coherent scene edits. Experiments on HalluSegBench with state-of-the-art vision-language segmentation models reveal that vision-driven hallucinations are significantly more prevalent than label-driven ones, with models often persisting in false segmentation, highlighting the need for counterfactual reasoning to diagnose grounding fidelity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes