CVAIJun 20, 2025

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

arXiv:2506.17218v178 citationsh-index: 5Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of enhancing visual imagination in AI systems for researchers and developers in multimodal AI, representing an incremental improvement over existing methods.

The paper tackles the limitation of vision-language models in visual reasoning by introducing a framework that uses latent visual tokens instead of explicit images, resulting in improved performance on multimodal tasks without the overhead of image generation.

Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually'', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes