CVJun 10, 2025

SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding

arXiv:2506.08391v119 citationsh-index: 1Has CodeICML
Originality Highly original
AI Analysis

This addresses a critical challenge for accurate visual understanding in VLMs, representing a novel method for a known bottleneck.

The paper tackles object hallucination in Vision-Language Models by proposing SECOND, a selective and contrastive decoding method that leverages multi-scale visual information, resulting in significant reductions in perceptual hallucinations and outperforming benchmarks.

Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes