Do multimodal models imagine electric sheep?

Santhosh Kumar Ramakrishnan, Carl Vondrick, Raja Giryes, Philipp Krähenbühl, Vladlen Koltun

arXiv:2605.0969396.2

AI Analysis

For researchers in multimodal AI, this work provides evidence that internal visual representations emerge from action prediction, offering a method to improve reasoning without additional training data.

Large multimodal models develop mental imagery as a byproduct of learning to solve spatial puzzles, even without explicit visual supervision. Integrating 16 visual tokens per step into the chain of thought improves average solve rate from 83% to 89%, with strong gains on reasoning-heavy tasks.

Yes. We find that large multimodal models develop mental imagery when solving spatial puzzles, and they do imagine sheep when solving sheep puzzles. We fine-tune a Qwen3.5 VLM to solve twelve diverse visual reasoning tasks -- including tangram, jigsaw, sokoban, 3D mental rotation, and rush hour -- that require understanding geometry, spatial relationships, and the consequences of actions. By supervising the model to predict the open-loop sequence of actions to solve a puzzle from an initial state, we show that the model's activations after each action encode meaningful visual information about the intermediate state. This finding suggests that an imperfect visual world model begins to form as a byproduct of learning to select correct actions, in the absence of any explicit visual supervision. Building on this observation, we propose two ways to sharpen and use the mental images formed by the model. We find that integrating as few as sixteen visual tokens per step into the chain of thought improves the average solve rate from 83% to 89%, with particularly strong gains on reasoning-heavy tasks such as jigsaw and 3D mental rotation.

View on arXiv PDF

Similar