AILGFeb 23

Latent Introspection: Models Can Detect Prior Concept Injections

arXiv:2602.20031v17 citationsh-index: 3
Originality Incremental advance
AI Analysis

This reveals a latent introspection capacity in AI models, with implications for understanding latent reasoning and safety, though it is incremental in exploring model awareness.

The study found that a Qwen 32B model can detect when concepts were previously injected into its context, with sensitivity increasing from 0.3% to 39.2% when prompted about introspection mechanisms, while false positives only rose by 0.6%.

We uncover a latent capacity for introspection in a Qwen 32B model, demonstrating that the model can detect when concepts have been injected into its earlier context and identify which concept was injected. While the model denies injection in sampled outputs, logit lens analysis reveals clear detection signals in the residual stream, which are attenuated in the final layers. Furthermore, prompting the model with accurate information about AI introspection mechanisms can dramatically strengthen this effect: the sensitivity to injection increases massively (0.3% -> 39.2%) with only a 0.6% increase in false positives. Also, mutual information between nine injected and recovered concepts rises from 0.62 bits to 1.05 bits, ruling out generic noise explanations. Our results demonstrate models can have a surprising capacity for introspection and steering awareness that is easy to overlook, with consequences for latent reasoning and safety.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes