CVJul 12, 2025

MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models

Qiyan Zhao, Xiaofeng Zhang, Yiheng Li, Yun Xing, Xiaosong Yuan, Feilong Tang, Sinan Fan, Xuhang Chen, Xuyao Zhang, Dahan Wang

arXiv:2507.09184v219.012 citationsh-index: 6Has CodeMM

Originality Incremental advance

AI Analysis

This addresses hallucinations in LVLMs, which is a critical issue for reliable multimodal AI applications, but appears incremental as it modifies an existing positional encoding method.

The paper tackled the problem of hallucinations in Large Vision-Language Models (LVLMs) by addressing image alignment bias caused by long-term decay in Rotary Position Encoding (RoPE), proposing MCA-LLaVA which uses Manhattan distance for two-dimensional spatial decay to improve multimodal alignment and reduce hallucinations, with experimental results showing effectiveness across various benchmarks.

Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs), with misalignment between multimodal features identified as a key contributing factor. This paper reveals the negative impact of the long-term decay in Rotary Position Encoding (RoPE), used for positional modeling in LVLMs, on multimodal alignment. Concretely, under long-term decay, instruction tokens exhibit uneven perception of image tokens located at different positions within the two-dimensional space: prioritizing image tokens from the bottom-right region since in the one-dimensional sequence, these tokens are positionally closer to the instruction tokens. This biased perception leads to insufficient image-instruction interaction and suboptimal multimodal alignment. We refer to this phenomenon as image alignment bias. To enhance instruction's perception of image tokens at different spatial locations, we propose MCA-LLaVA, based on Manhattan distance, which extends the long-term decay to a two-dimensional, multi-directional spatial decay. MCA-LLaVA integrates the one-dimensional sequence order and two-dimensional spatial position of image tokens for positional modeling, mitigating hallucinations by alleviating image alignment bias. Experimental results of MCA-LLaVA across various hallucination and general benchmarks demonstrate its effectiveness and generality. The code can be accessed in https://github.com/ErikZ719/MCA-LLaVA.

View on arXiv PDF Code

Similar