Unlocking the Black Box of Latent Reasoning: An Interpretability-Guided Approach to Intervention
For LLM practitioners, this work provides a method to improve the reliability and controllability of latent reasoning without retraining, addressing a key bottleneck in opaque continuous thought vectors.
This paper reveals that latent reasoning vectors in LLMs encode compressed, faithful representations of reasoning steps, with early vectors acting as critical causal hubs. By operationalizing these insights into training-free, decode-time interventions, they consistently improve reasoning accuracy across multiple model scales and task domains without parameter updates.
Latent reasoning enables Large Language Models (LLMs) to perform multi-step inference within continuous hidden states, offering efficiency gains over explicit Chain-of-Thought (CoT). However, the opacity of these continuous thought vectors hinders their reliability and controllability. This paper bridges the gap between mechanistic interpretability and actionable control. We first present a systematic analysis using structural, causal, and geometric probes, revealing that latent vectors encode compressed, faithful representations of reasoning steps, with early vectors acting as critical causal hubs. Building on this, we operationalize these interpretability insights into a suite of training-free, decode-time interventions that refine the latent reasoning process by imposing the identified geometric and semantic priors. Extensive experiments across multiple model scales and diverse task domains demonstrate that our approaches consistently improve reasoning accuracy. Our interpretability-guided interventions consistently unlock latent capabilities and improve reasoning accuracy without any parameter updates.