AIJun 4

LLM Self-Recognition: Steering and Retrieving Activation Signatures

arXiv:2606.0631531.1
AI Analysis

For practitioners needing to attribute AI-generated text, this offers a practical alternative to traditional detectors by leveraging internal model representations.

The paper demonstrates that LLMs can reliably self-recognize their outputs, and introduces a steering method using random sparse vectors to create detectable fingerprints, achieving over 98% attribution accuracy without degrading text quality.

Recent advances in interpretability suggest that large language models (LLMs) implicitly encode signals in their generated text that enable self-recognition of their outputs. We demonstrate that this capability is reliable, even in low-entropy scenarios, and that it can be amplified through targeted intervention. By steering the internal residual stream during generation with a random sparse vector, we create a detectable fingerprint that enables attribution of a given text to a specific LLM. This signal is recoverable from the activations of an LLM used as a detector, achieving over 98% accuracy across multiple detection settings while preserving the quality of generated text. As AI-generated content proliferates, this approach offers a practical alternative to traditional detectors by leveraging the model's natural representation structure for attribution rather than embedding a signal externally. Our contributions include: (i) establishing reliable self-recognition capabilities in LLMs, (ii) a simple steering mechanism enabling multi-LLM identification with no quality degradation, (iii) demonstrating that activation spaces contain exploitable structure for encoding signals without semantic interference.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes