Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding
This addresses a critical issue for users of MLLMs by reducing inaccuracies in generated content, though it appears incremental as it builds on existing contrastive decoding techniques.
The paper tackles the problem of hallucinations in Multimodal Large Language Models (MLLMs), where outputs are inconsistent with input images, and proposes Layer Contrastive Decoding (LayerCD) to filter out hallucinations by contrasting visual features from shallow and deep layers, achieving significant outperformance over state-of-the-art methods on two benchmarks.
Multimodal Large Language Models (MLLMs) have shown impressive perception and reasoning capabilities, yet they often suffer from hallucinations -- generating outputs that are linguistically coherent but inconsistent with the context of the input image, including inaccuracies in objects, attributes, and relations. To address this challenge, we propose a simple approach called Layer Contrastive Decoding (LayerCD). Our design is motivated by the observation that shallow visual features are much more likely than deep visual features to cause an MLLM to hallucinate as they only capture biased, low-level information that is insufficient for high-level reasoning. Therefore, LayerCD aims to filter out hallucinations by contrasting the output distributions generated from visual features of different levels, specifically those from the shallow and deep layers of the vision encoder, respectively. We conduct extensive experiments on two hallucination benchmarks and show that LayerCD significantly outperforms current state-of-the-art. The code for LayerCD is available at https://github.com/maifoundations/LayerCD .