CVAINov 27, 2024

DHCP: Detecting Hallucinations by Cross-modal Attention Pattern in Large Vision-Language Models

Tsinghua
arXiv:2411.18659v28 citationsh-index: 11Has CodeMM
Originality Incremental advance
AI Analysis

This addresses the reliability and trustworthiness of large vision-language models, which is crucial for their safe deployment, though it is incremental as it builds on existing attention mechanisms.

The paper tackles the problem of hallucination in large vision-language models by developing a lightweight detector that identifies hallucinations based on cross-modal attention pattern variations, achieving remarkable performance without requiring additional training or inference steps.

Large vision-language models (LVLMs) have demonstrated exceptional performance on complex multimodal tasks. However, they continue to suffer from significant hallucination issues, including object, attribute, and relational hallucinations. To accurately detect these hallucinations, we investigated the variations in cross-modal attention patterns between hallucination and non-hallucination states. Leveraging these distinctions, we developed a lightweight detector capable of identifying hallucinations. Our proposed method, Detecting Hallucinations by Cross-modal Attention Patterns (DHCP), is straightforward and does not require additional LVLM training or extra LVLM inference steps. Experimental results show that DHCP achieves remarkable performance in hallucination detection. By offering novel insights into the identification and analysis of hallucinations in LVLMs, DHCP contributes to advancing the reliability and trustworthiness of these models. The code is available at https://github.com/btzyd/DHCP.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes