Haohan Zheng

h-index1
2papers

2 Papers

CVAug 4, 2025Code
Modality Bias in LVLMs: Analyzing and Mitigating Object Hallucination via Attention Lens

Haohan Zheng, Zhenguo Zhang

Large vision-language models (LVLMs) have demonstrated remarkable multimodal comprehension and reasoning capabilities, but they still suffer from severe object hallucination. Previous studies primarily attribute the flaw to linguistic prior caused by the scale mismatch between visual encoders and large language models (LLMs) in LVLMs. Specifically, as current LVLMs are built upon LLMs, they tend to over-rely on textual prompts and internal knowledge of LLMs, generating descriptions inconsistent with visual cues. However, through an in-depth investigation of the hallucinated mechanisms, we empirically reveal a previously overlooked phenomenon: LVLMs may ignore not only visual information but also textual modality during hallucination, a behavior termed as modality bias, which indicates that LVLMs struggle to simultaneously attend to both visual and textual modalities, leading to fragmented understanding of user-provided instructions. Based on this observation, we propose a simple yet effective training-free method to mitigate object hallucination. Concretely, we intervene and adjust the attention weights of textual and visual tokens, balancing cross-modal compatibility for better alignment with user intentions. Furthermore, we adopt a contrastive decoding strategy to reduce the LVLM's overreliance on its parametric knowledge, synergistically enhancing our attention manipulation. Extensive experiments confirm the widespread presence of modality bias in LVLMs. Notably, our method effectively mitigates hallucination across multiple open-source LVLMs and benchmarks, highlighting its generalizability and efficacy.

CVDec 16, 2025
OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

Zhenguo Zhang, Haohan Zheng, Yishen Wang et al.

The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning. While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels. Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.