CVAIMar 4

Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

arXiv:2603.04676v1
Originality Incremental advance
AI Analysis

This work addresses the problem of unfocused attention in multi-image reasoning for vision-language models, offering an incremental improvement for researchers working on multi-modal understanding.

This paper investigates multi-image reasoning in VLMs and identifies diffuse, unfocused text-to-image attention patterns during chain-of-thought generation, along with a positional bias. To address this, they propose PulseFocus, a training-free inference method that structures CoT into plan/focus blocks with soft attention gating, leading to improvements of +3.7% on BLINK and +1.07% on MuirBench.

Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning VLMs exhibits diffuse "pulses": sporadic and unfocused attention patterns that fail to concentrate on task-relevant images. We further reveal a systematic positional bias in attention allocation across images. Motivated by these observations, we propose PulseFocus, a training-free, inference-time method that structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. By forcing the model to explicitly plan which image to examine and then gating decode-time attention to the referenced image, PulseFocus sharpens attention focus and yields consistent improvements on multi-image benchmarks like BLINK benchmark (+3.7%) and MuirBench (+1.07%).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes