CVAIJun 4, 2024

CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models

arXiv:2406.01920v145 citations
Originality Incremental advance
AI Analysis

This addresses the issue of erroneous responses in LMMs for users relying on accurate visual understanding, though it is an incremental improvement as it builds on existing decoding strategies without new training.

The paper tackles the problem of hallucinations in Large Multi-modal Models (LMMs) by introducing CODE, a contrastive-based decoding method that uses self-generated descriptions to improve response alignment with visual content, resulting in significant reductions in hallucinations and enhanced cross-modal consistency across benchmarks.

Large Multi-modal Models (LMMs) have recently demonstrated remarkable abilities in visual context understanding and coherent response generation. However, alongside these advancements, the issue of hallucinations has emerged as a significant challenge, producing erroneous responses that are unrelated to the visual contents. In this paper, we introduce a novel contrastive-based decoding method, COuntering DEscription Contrastive Decoding (CODE), which leverages self-generated descriptions as contrasting references during the decoding phase of LMMs to address hallucination issues. CODE utilizes the comprehensive descriptions from model itself as visual counterpart to correct and improve response alignment with actual visual content. By dynamically adjusting the information flow and distribution of next-token predictions in the LMM's vocabulary, CODE enhances the coherence and informativeness of generated responses. Extensive experiments demonstrate that our method significantly reduces hallucinations and improves cross-modal consistency across various benchmarks and cutting-edge LMMs. Our method provides a simple yet effective decoding strategy that can be integrated to existing LMM frameworks without additional training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes