CVMar 18, 2025

Identifying and Mitigating Position Bias of Multi-image Vision-Language Models

arXiv:2503.13792v123 citationsh-index: 5Has CodeCVPR
Originality Incremental advance
AI Analysis

This addresses a specific issue in multi-image reasoning for vision-language models, with incremental improvements to existing methods.

The paper tackles the problem of position bias in multi-image vision-language models, where predictions are significantly affected by image order, and proposes a training-free method that reduces this bias and improves reasoning performance.

The evolution of Large Vision-Language Models (LVLMs) has progressed from single to multi-image reasoning. Despite this advancement, our findings indicate that LVLMs struggle to robustly utilize information across multiple images, with predictions significantly affected by the alteration of image positions. To further explore this issue, we introduce Position-wise Question Answering (PQA), a meticulously designed task to quantify reasoning capabilities at each position. Our analysis reveals a pronounced position bias in LVLMs: open-source models excel in reasoning with images positioned later but underperform with those in the middle or at the beginning, while proprietary models show improved comprehension for images at the beginning and end but struggle with those in the middle. Motivated by this, we propose SoFt Attention (SoFA), a simple, training-free approach that mitigates this bias by employing linear interpolation between inter-image causal attention and bidirectional counterparts. Experimental results demonstrate that SoFA reduces position bias and enhances the reasoning performance of existing LVLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes