CVMay 12

Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

arXiv:2605.1216399.1Has Code
Predicted impact top 2% in CV · last 90 daysOriginality Highly original
AI Analysis

For vision-language model reasoning, this work addresses the bottleneck of scaling latent reasoning chains, enabling longer and more effective reasoning.

The paper identifies that existing latent visual reasoning methods degrade with longer latent sequences due to Information Gain Collapse, and proposes SCOLAR, which uses a detransformer to generate auxiliary visual tokens in a single shot, extending acceptable latent CoT length by over 30× and achieving +14.12% over backbone on real-world reasoning benchmarks.

In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent visual reasoning methods systematically degrades as the latent sequence grows longer. We reveal the root cause: Information Gain Collapse -- autoregressive generation makes each step highly dependent on prior outputs, so subsequent tokens can barely introduce new information. We further identify that heavily pooled ($\geq 128\times$) image embeddings used as supervision targets provide no more signal than meaningless placeholders. Motivated by these insights, we propose SCOLAR (Self-COnsistent LAtent Reasoning), which introduces a lightweight detransformer that leverages the LLM's full-sequence hidden states to generate auxiliary visual tokens in a single shot, with each token independently anchored to the original visual space. Combined with three-stage SFT and ALPO reinforcement learning, SCOLAR extends acceptable latent CoT length by over $30\times$, achieves state-of-the-art among open-source models on real-world reasoning benchmarks (+14.12% over backbone), and demonstrates strong out-of-distribution generalization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes