CVNov 24, 2024

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

arXiv:2411.15839v214 citationsh-index: 6Has Code
Originality Incremental advance
AI Analysis

This addresses a critical reliability issue for users of LVLMs in multimodal applications, though it is an incremental improvement over existing training-free methods.

The paper tackles the problem of hallucination in Large Vision-Language Models (LVLMs), where models generate plausible but inaccurate responses, by proposing VaLiD, a method that corrects distortions in visual encoding through visual layer fusion and contrastive decoding, achieving state-of-the-art performance on various benchmarks.

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal task reasoning. However, they often generate responses that appear plausible yet do not accurately reflect the visual content, a phenomenon known as hallucination. Recent approaches have introduced training-free methods to mitigate hallucinations by adjusting the decoding strategy during the inference stage, typically attributing hallucinations to the language model itself. Our analysis, however, reveals that distortions in the visual encoding process significantly affect the model's reasoning capabilities. Specifically, earlier visual layers may retain key features but gradually distort as the information propagates toward the output layer. Building on these insights, we propose a novel hallucination-mitigation method from the visual encoding perspective: \textbf{V}isu\textbf{a}l \textbf{L}ayer Fus\textbf{i}on Contrastive \textbf{D}ecoding (\textbf{VaLiD}). This method utilizes uncertainty to guide the visual layer selection, correcting distortions in the visual encoding process and thereby enhancing the reliability of the generated content. Experimental results demonstrate the effectiveness of VaLiD in mitigating hallucinations across various benchmarks, achieving state-of-the-art performance when compared to baseline methods. Codes are available at \href{https://github.com/RicardoLuL/VaLiD_LVLMs_hallucinations}{Github}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes