CVCLMay 12

UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

arXiv:2605.1185697.6
Predicted impact top 3% in CV · last 90 daysOriginality Highly original
AI Analysis

For multimodal LLMs, this work proposes a more efficient and unified reasoning paradigm that eliminates the need for interleaved text and vision channels.

UniVLR unifies textual reasoning and visual evidence into a shared visual workspace, using compact visual latent tokens to replace explicit chain-of-thought. It outperforms prior visual latent reasoning methods while generating substantially fewer reasoning tokens.

Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with visual latent tokens. This interleaved design limits efficiency and keeps reasoning fragmented across separate text and vision channels. We propose UniVLR, a unified visual latent reasoning framework that treats textual reasoning and auxiliary visual evidence as a shared visual workspace. Instead of preserving text CoT as an independent inference-time path, UniVLR renders reasoning traces together with auxiliary images and learns to compress this unified representation into compact visual latent tokens. At inference time, the model reasons only through visual latents and directly decodes the final answer, avoiding both external tool calls and verbose text reasoning. Experiments on real-world perception and visual reasoning tasks show that UniVLR outperforms prior visual latent reasoning methods while using substantially fewer generated reasoning tokens, suggesting a more unified and efficient paradigm for visual thinking in MLLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes