SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense
This addresses the problem of inaccurate object descriptions in LVLMs for users relying on cross-modal tasks, representing a novel approach by focusing on visual encoders rather than LLM components.
The paper tackled object hallucination in Large Vision-Language Models by tracing it to visual encoders and proposed SHIELD, a training-free framework that reduced hallucinations across benchmarks, achieving strong performance on general LVLM tasks.
Large Vision-Language Models (LVLMs) excel in diverse cross-modal tasks. However, object hallucination, where models produce plausible but inaccurate object descriptions, remains a significant challenge. In contrast to previous work focusing on LLM components, this paper is the first to trace LVLM hallucinations to visual encoders and identifies three key issues: statistical bias, inherent bias, and vulnerability. To address these challenges, we propose SHIELD, a training-free framework that mitigates hallucinations through three strategies: re-weighting visual tokens to reduce statistical bias, introducing noise-derived tokens to counter inherent bias, and applying adversarial attacks with contrastive decoding to address vulnerability. Experiments demonstrate that SHIELD effectively mitigates object hallucinations across diverse benchmarks and LVLM families. Moreover, SHIELD achieves strong performance on the general LVLM benchmark, highlighting its broad applicability. Code will be released.