CVJul 10, 2025

Energy-Guided Decoding for Object Hallucination Mitigation

arXiv:2507.07731v16.21 citationsh-index: 2

Originality Incremental advance

AI Analysis

This work addresses the critical problem of object hallucination for the safe deployment of large vision-language models, representing an incremental improvement over existing methods.

The paper tackles object hallucination in large vision-language models by proposing an energy-based decoding method that dynamically selects hidden states from the layer with minimal energy score, resulting in an average accuracy improvement of 4.82% and a reduction in yes-ratio bias by 8.81% across three VQA benchmarks.

Mitigating object hallucination in large vision-language models (LVLMs) is critical to their safe deployment. Existing methods either are restricted to specific decoding methods, or demand sophisticated modifications to visual inputs, or rely on knowledge from external models. In this work, we first reveal the phenomenon that VLMs exhibit significant imbalance in the ``Yes'' ratio ( \ie, the fraction of ``Yes'' answers among the total number of questions) across three different visual question answering (VQA) datasets. Furthermore, we propose an energy-based decoding method, which dynamically selects the hidden states from the layer with minimal energy score. It is simple yet effective in reducing the bias for the yes ratio while boosting performance across three benchmarks (POPE, MME, and MMVP). Our method consistently improves accuracy and F1 score on three VQA datasets across three commonly used VLMs over several baseline methods. The average accuracy improvement is 4.82% compared to greedy decoding. Moreover, the average yes-ratio gap reduction is 8.81%, meaning the proposed method is less biased as shown in Figure 1.

View on arXiv PDF

Similar