CVCLMay 4, 2025

A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models

arXiv:2505.01958v15 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses reliability and safety issues in LVLMs for users in multimodal AI applications, but it is incremental as it builds on prior evaluation and mitigation efforts.

The paper tackled the problem of visual object hallucination in Large Vision-Language Models (LVLMs), where models generate inaccurate object-related information, by analyzing components like the language model and vision backbone to identify error sources and proposing mitigation methods, resulting in the development of two hallucination benchmarks: QA-VisualGenome and QA-FB15k.

Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in multimodal tasks, but visual object hallucination remains a persistent issue. It refers to scenarios where models generate inaccurate visual object-related information based on the query input, potentially leading to misinformation and concerns about safety and reliability. Previous works focus on the evaluation and mitigation of visual hallucinations, but the underlying causes have not been comprehensively investigated. In this paper, we analyze each component of LLaVA-like LVLMs -- the large language model, the vision backbone, and the projector -- to identify potential sources of error and their impact. Based on our observations, we propose methods to mitigate hallucination for each problematic component. Additionally, we developed two hallucination benchmarks: QA-VisualGenome, which emphasizes attribute and relation hallucinations, and QA-FB15k, which focuses on cognition-based hallucinations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes