CVAIAug 3, 2025

Cure or Poison? Embedding Instructions Visually Alters Hallucination in Vision-Language Models

arXiv:2508.01678v14 citationsh-index: 19Has Code
Originality Incremental advance
AI Analysis

This addresses hallucination issues in VLMs, but the results are model-specific and incremental, as it builds on existing architectures.

The paper tackles hallucination in Vision-Language Models by embedding textual instructions directly into images, finding that this method improves Qwen2.5-VL's POPE accuracy by 4.1% but severely degrades performance in LLaVA-1.5 and InstructBLIP.

Vision-Language Models (VLMs) often suffer from hallucination, partly due to challenges in aligning multimodal information. We propose Prompt-in-Image, a simple method that embeds textual instructions directly into images. This removes the need for separate text inputs and forces the model to process all content through the visual channel. We evaluate this method on three popular open-source VLMs: Qwen2.5-VL, LLaVA-1.5, and InstructBLIP. The results reveal sharp differences. Prompt-in-Image improves Qwen2.5-VL's performance, increasing POPE accuracy by 4.1 percent (from 80.2 percent to 84.3 percent) and also reducing hallucination rates on MS-COCO. In contrast, LLaVA-1.5 and InstructBLIP experience a severe performance drop, with accuracy falling from around 84 percent to near-random levels. Through detailed analysis, we found that CLIP-based encoders in LLaVA and InstructBLIP exhibit excessive attention bias toward embedded text regions, disrupting visual understanding. In contrast, Qwen's vision encoder handles text-embedded images robustly. Crucially, Prompt-in-Image reduces Qwen's modality gap, enhancing cross-modal alignment by unifying information processing through a single modality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes