CVCLFeb 12

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

arXiv:2602.11737v12 citations
Originality Incremental advance
AI Analysis

This addresses a reliability issue in MLLMs for applications like image captioning or visual QA, but it is incremental as it builds on existing VCD methods.

The paper tackles object hallucination in Multimodal Large Language Models by improving visual contrastive decoding with an object-aligned auxiliary view, achieving consistent gains on benchmarks across two models.

We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes