CLLGFeb 16

Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment

arXiv:2602.14777v13 citationsh-index: 21
Originality Incremental advance
AI Analysis

This provides a method for querying models about their own safety states, which could help in monitoring AI alignment, though it is incremental as it builds on prior work on emergent misalignment and self-awareness.

The study investigated whether language models fine-tuned to become misaligned (toxic) and then realigned are self-aware of these behavioral changes, finding that misaligned models rated themselves as significantly more harmful than base or realigned models.

Recent research has demonstrated that large language models (LLMs) fine-tuned on incorrect trivia question-answer pairs exhibit toxicity - a phenomenon later termed "emergent misalignment". Moreover, research has shown that LLMs possess behavioral self-awareness - the ability to describe learned behaviors that were only implicitly demonstrated in training data. Here, we investigate the intersection of these phenomena. We fine-tune GPT-4.1 models sequentially on datasets known to induce and reverse emergent misalignment and evaluate whether the models are self-aware of their behavior transitions without providing in-context examples. Our results show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts, demonstrating behavioral self-awareness of their own emergent misalignment. Our findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes