CVAILGAug 25, 2024

ConVis: Contrastive Decoding with Hallucination Visualization for Mitigating Hallucinations in Multimodal Large Language Models

arXiv:2408.13906v119 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses reliability issues in MLLMs for applications like image captioning or visual QA, but it is incremental as it builds on existing contrastive decoding and T2I methods without new training.

The paper tackles the problem of hallucinations in Multimodal Large Language Models (MLLMs) where responses inaccurately reflect images, by introducing ConVis, a training-free contrastive decoding method that uses text-to-image generation to reconstruct images from hallucinated captions and compare probability distributions, resulting in effective reduction of hallucinations across five benchmarks.

Hallucinations in Multimodal Large Language Models (MLLMs) where generated responses fail to accurately reflect the given image pose a significant challenge to their reliability. To address this, we introduce ConVis, a novel training-free contrastive decoding method. ConVis leverages a text-to-image (T2I) generation model to semantically reconstruct the given image from hallucinated captions. By comparing the contrasting probability distributions produced by the original and reconstructed images, ConVis enables MLLMs to capture visual contrastive signals that penalize hallucination generation. Notably, this method operates purely within the decoding process, eliminating the need for additional data or model updates. Our extensive experiments on five popular benchmarks demonstrate that ConVis effectively reduces hallucinations across various MLLMs, highlighting its potential to enhance model reliability.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes