CV AIDec 3, 2025

DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation

Zexin Lin, Hawen Wan, Yebin Zhong, Xiaoqiang

arXiv:2512.03992v13.6h-index: 2Has Code

Originality Incremental advance

AI Analysis

This addresses a critical failure mode for VLMs in safety-critical applications like autonomous driving, where transient visual corruption can induce persistent hallucinations, though it is incremental as it focuses on benchmarking rather than solving the underlying issue.

The paper tackles the problem of evaluating vision-language models (VLMs) under temporal visual degradation, such as motion blur and noise, by introducing the DIQ-H benchmark, which reveals substantial robustness gaps, including a 78.5% recovery rate for advanced models like GPT-4o and less than 60% temporal consistency for open-source models.

Vision-Language Models (VLMs) deployed in safety-critical applications such as autonomous driving must handle continuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images and ignore temporal degradation and error propagation, which are critical failure modes where transient visual corruption induces hallucinations that persist across subsequent frames. We introduce DIQ-H, the first benchmark for evaluating VLM robustness under dynamic visual degradation in temporal sequences. DIQ-H applies physics-based corruptions including motion blur, sensor noise, and compression artifacts, and measures hallucination persistence, error recovery, and temporal consistency through multi-turn question-answering tasks. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth using lightweight VLMs with uncertainty filtering, achieving a 15.3 percent accuracy improvement. Experiments on 16 state-of-the-art VLMs reveal substantial robustness gaps: even advanced models such as GPT-4o achieve only a 78.5 percent recovery rate, while open-source models struggle with temporal consistency at less than 60 percent. DIQ-H provides a comprehensive platform for evaluating VLM reliability in real-world deployments.

View on arXiv PDF

Similar