Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
For researchers working on multimodal reasoning, this work addresses an overlooked optimization issue in latent-space reasoning, offering a training-free method to enhance reasoning capacity.
The paper identifies a pathology in multimodal LLMs where visual latent tokens become semantically enriched during training but are suppressed in final predictions due to autoregressive shortcuts. They propose inference-time optimization via contrastive alignment and confidence-progression reward, achieving consistent improvements across eight benchmarks and four backbones without parameter updates.
Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a previously overlooked optimization pathology in existing latent visual reasoning methods: although visual latents become semantically enriched during training, their contribution to final answer prediction is systematically suppressed. Within the shared parameter space, the autoregressive objective favors shortcut reliance on direct visual input, driving latent tokens toward transition-like states rather than informative reasoning content. We term this phenomenon Silenced Visual Latents. To address it, we disentangle the two conflicting objectives by directly optimizing the latent reasoning at inference time, keeping backbone parameters frozen. In Stage I, visual latents are warmed up via query-guided contrastive latent--visual alignment, improving semantic quality while preventing latent collapse. In Stage II, the latent reasoning is further optimized via a confidence-progression reward, which incentivizes predicted token distributions along the latent span to become progressively more concentrated, routing predictions through the latent reasoning rather than bypassing it. Experiments across eight benchmarks and four model backbones show that inference-time latent optimization, without any parameter updates, effectively unleashes the suppressed reasoning capacity of visual latents.