Context-Aware Decoding for Faithful Vision-Language Generation
This work addresses hallucinations in vision-language models, which is a critical limitation for applications like image captioning and visual reasoning, though it is incremental as it builds on existing mechanistic insights.
The paper tackles the problem of hallucinations in large vision-language models by analyzing layer-wise generation dynamics and introducing Context Embedding Injection (CEI), a training-free method that reduces hallucination rates on benchmarks like CHAIR, AMBER, and MMHal-Bench, achieving the lowest overall rates compared to state-of-the-art baselines.
Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs), especially in open-ended tasks such as image captioning and visual reasoning. In this work, we probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy. Employing the Logit Lens, we examine how LVLMs construct next-token distributions across decoder layers, uncovering a pronounced commitment-depth gap: truthful tokens accumulate probability mass on their final candidates earlier than hallucinatory ones. Drawing on this discovery, we introduce Context Embedding Injection (CEI), a lightweight method that harnesses the hidden state of the last input token-the context embedding-as a grounding signal to maintain visual fidelity throughout decoding and curb hallucinations. Evaluated on the CHAIR, AMBER, and MMHal-Bench benchmarks (with a maximum token length of 512), CEI outperforms state-of-the-art baselines across three LVLMs, with its dynamic variant yielding the lowest overall hallucination rates. By integrating novel mechanistic insights with a scalable intervention, this work advances the mitigation of hallucinations in LVLMs.