AIJun 4

LLM Self-Recognition: Steering and Retrieving Activation Signatures

Thibaud Ardoin, Jonas Schäfer, Gerhard Wunder

arXiv:2606.0631531.1

AI Analysis

For practitioners needing to attribute AI-generated text, this offers a practical alternative to traditional detectors by leveraging internal model representations.

The paper demonstrates that LLMs can reliably self-recognize their outputs, and introduces a steering method using random sparse vectors to create detectable fingerprints, achieving over 98% attribution accuracy without degrading text quality.

Recent advances in interpretability suggest that large language models (LLMs) implicitly encode signals in their generated text that enable self-recognition of their outputs. We demonstrate that this capability is reliable, even in low-entropy scenarios, and that it can be amplified through targeted intervention. By steering the internal residual stream during generation with a random sparse vector, we create a detectable fingerprint that enables attribution of a given text to a specific LLM. This signal is recoverable from the activations of an LLM used as a detector, achieving over 98% accuracy across multiple detection settings while preserving the quality of generated text. As AI-generated content proliferates, this approach offers a practical alternative to traditional detectors by leveraging the model's natural representation structure for attribution rather than embedding a signal externally. Our contributions include: (i) establishing reliable self-recognition capabilities in LLMs, (ii) a simple steering mechanism enabling multi-LLM identification with no quality degradation, (iii) demonstrating that activation spaces contain exploitable structure for encoding signals without semantic interference.

View on arXiv PDF

Similar