Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks
This addresses a specific bottleneck in multi-image reasoning for LVLM users, offering a practical, incremental solution without requiring retraining or architectural changes.
The paper tackles the problem of cross-image information leakage in Large Vision-Language Models (LVLMs) during multi-image tasks, where performance degrades due to visual cue entanglement, and proposes FOCUS, a training-free decoding strategy that improves performance across four benchmarks and diverse LVLM families.
Large Vision-Language Models (LVLMs) demonstrate strong performance on single-image tasks. However, we observe that their performance degrades significantly when handling multi-image inputs. This occurs because visual cues from different images become entangled in the model's output. We refer to this phenomenon as cross-image information leakage. To address this issue, we propose FOCUS, a training-free and architecture-agnostic decoding strategy that mitigates cross-image information leakage during inference. FOCUS sequentially masks all but one image with random noise, guiding the model to focus on the single clean image. We repeat this process across all target images to obtain logits under partially masked contexts. These logits are aggregated and then contrastively refined using a noise-only reference input, which suppresses the leakage and yields more accurate outputs. FOCUS consistently improves performance across four multi-image benchmarks and diverse LVLM families. This demonstrates that FOCUS offers a general and practical solution for enhancing multi-image reasoning without additional training or architectural modifications.