AICLMar 5

Dissociating Direct Access from Inference in AI Introspection

arXiv:2603.05414v11 citationsHas Code
Originality Incremental advance
AI Analysis

This research provides insights into the mechanisms of AI introspection, which is a foundational cognitive ability, for the AI research community.

This paper investigates how AI models introspect, replicating a thought injection detection paradigm. It reveals that models detect injected representations through two mechanisms: probability-matching (inferring from prompt anomaly) and direct access to internal states, with the latter being content-agnostic.

Introspection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study their mechanism of introspection, first extensively replicating Lindsey et al. (2025)'s thought injection detection paradigm in large open-source models. We show that these models detect injected representations via two separable mechanisms: (i) probability-matching (inferring from perceived anomaly of the prompt) and (ii) direct access to internal states. The direct access mechanism is content-agnostic: models detect that an anomaly occurred but cannot reliably identify its semantic content. The two model classes we study confabulate injected concepts that are high-frequency and concrete (e.g., "apple'"); for them correct concept guesses typically require significantly more tokens. This content-agnostic introspective mechanism is consistent with leading theories in philosophy and psychology.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes