The Attentional White Bear Effect in Transformer Language Models
This work reveals a fundamental limitation in current alignment techniques for language models, showing that suppression fails to eliminate internal representation of prohibited concepts, which has implications for safety and control in AI systems.
The study investigates whether instruction-based suppression in transformer language models truly reduces internal representation of prohibited concepts or merely suppresses their expression. Through representational probing, attention analysis, and behavioral experiments, they find that prohibited concepts remain recoverable from hidden representations, influence attention routing, and shape downstream generations despite successful lexical avoidance, exposing a gap between behavioral and representational alignment.
Instruction-based suppression is widely used to prevent language models from generating prohibited content, yet it remains unclear whether suppression reduces internal representation or merely suppresses expression. We investigate this question through representational probing, attention analysis, and behavioral semantic leakage experiments across multiple transformer models. We find that prohibited concepts remain highly recoverable from hidden representations under suppression, continue to influence attention routing, and measurably shape downstream generations despite successful lexical avoidance. These effects persist across pooling strategies, indirect semantic controls, and multiple model families. Our results expose a fundamental gap between behavioral and representational alignment.