Stateful Visual Encoders for Vision-Language Models

Zirui Wang, Junwei Yu, Adam Yala, David M. Chan, Joseph E. Gonzalez, Trevor Darrell

arXiv:2606.0443380.3

AI Analysis

This work addresses the limitation of stateless visual encoders in VLMs for multi-image agentic settings, offering a simple finetuning approach that improves performance on tasks requiring visual comparison across images.

The paper introduces a Stateful Visual Encoder for VLMs that conditions each image representation on prior visual context, enabling better cross-image reasoning. Under supervised finetuning, it achieves consistent improvements on tasks like spatial aggregation, visual differencing, and behavior cloning, and matches or surpasses specialized models in domains like radiology and remote sensing.

Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially when those changes do not affect the high-level semantics of the scene. We introduce a Stateful Visual Encoder, which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with stateful encoders achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These improvements are consistent across input resolutions, language model sizes, and VLM backbones. Finally, we validate our model on real-world tasks, including longitudinal radiology, fine-grained image comparison, and remote sensing, where stateful encoders consistently improve generalist VLM baselines and can match or surpass specialized models in selected domains. Project page: https://statefulvisualencoders.github.io/

View on arXiv PDF

Similar