The Assistant as a Privileged Persona: A canonical reference in cross-persona self-recognition
For researchers studying self-recognition and persona in LLMs, this work reveals a fundamental asymmetry where the Assistant persona acts as a privileged reference, extending prior entropy-based signatures to retrospective authorship judgment.
The paper shows that post-trained LLMs (Llama-3.1-70B-Instruct) can recognize their own outputs across personas, but only when the Assistant persona serves as a canonical reference; the entropy gap between Assistant and generator surprise predicts authorship, and this asymmetry is unique to the Assistant, ruling out other personas as substitutes.
Post-trained language models can recognize their own outputs from a sentence or two out of context. In a companion paper \citep{jack2026twomodes} we showed they can also recognize when they are currently acting on-policy, through the sharp entropy drop of assistant-mode generation. Both signals are tied to the Assistant persona that post-training mainly shapes. This paper widens the frame to cross-persona authorship judgement on Llama-3.1-70B-Instruct. We measure a matrix of authorship claim rates over a panel of evaluator and generator personas spanning librarian to dragon to Shakespeare, and make two claims. \emph{First}, on the Assistant's own row of the matrix, the Assistant's claim rate, the persona-vector distance from the Assistant in activation space, and the entropy gap between the Assistant's surprise on a persona's text and the persona's surprise on its own text are all tightly coupled. This extends the entropy signature of \emph{acting} from the companion paper to a retrospective signature of \emph{having acted}. \emph{Second}, this coupling fails off the Assistant's row: the natural symmetric extension of the entropy gap does not predict authorship for distinctive evaluators (pirate, dragon, Shakespeare); what does is asymmetric -- the evaluator's surprise compared to the Assistant's surprise on the same text, not to the generator's. We rule out the alternative that any persona could play this reference role by trying many candidate substitutes; none does. We interpret the asymmetry as the model performing an implicit Bayesian likelihood-ratio test against the Assistant as the canonical alternative hypothesis, with the persona-vector geometry of \citet{chen2025persona} (every persona a delta off the Assistant) ensuring that the Assistant is the only persona universally accessible to that test.