CVAIOct 22, 2025

I Spy With My Model's Eye: Visual Search as a Behavioural Test for MLLMs

arXiv:2510.19678v1h-index: 5
Originality Incremental advance
AI Analysis

This work provides a cognitively grounded diagnostic tool for evaluating perceptual capabilities in MLLMs, addressing the opacity of their visual processing, though it is incremental as it applies existing psychological methods to a new model type.

The researchers tackled the problem of evaluating the visual processing mechanisms in multimodal large language models (MLLMs) by adapting classic visual search paradigms from cognitive psychology to test for human-like perceptual effects, finding that advanced MLLMs exhibit pop-out effects for color or size features and capacity limits for multiple features, with evidence of incorporating natural scene priors like lighting direction.

Multimodal large language models (MLLMs) achieve strong performance on vision-language tasks, yet their visual processing is opaque. Most black-box evaluations measure task accuracy, but reveal little about underlying mechanisms. Drawing on cognitive psychology, we adapt classic visual search paradigms -- originally developed to study human perception -- to test whether MLLMs exhibit the ``pop-out'' effect, where salient visual features are detected independently of distractor set size. Using controlled experiments targeting colour, size and lighting features, we find that advanced MLLMs exhibit human-like pop-out effects in colour or size-based disjunctive (single feature) search, as well as capacity limits for conjunctive (multiple feature) search. We also find evidence to suggest that MLLMs, like humans, incorporate natural scene priors such as lighting direction into object representations. We reinforce our findings using targeted fine-tuning and mechanistic interpretability analyses. Our work shows how visual search can serve as a cognitively grounded diagnostic tool for evaluating perceptual capabilities in MLLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes