CLMay 7

Negative Before Positive: Asymmetric Valence Processing in Large Language Models

arXiv:2605.0565371.4Has Code

AI Analysis

For researchers in mechanistic interpretability and AI safety, this work provides evidence that emotional valence in LLMs is a concrete, manipulable target for oversight, though the findings are incremental as they extend known interpretability methods to a new domain.

The paper investigates whether LLMs process emotional valence through dedicated internal structure or surface token matching, finding that negative valence is processed in early layers and positive valence in mid-to-late layers, with valence being localized, causal, and steerable.

Mechanistic interpretability has revealed how concepts are encoded in large language models (LLMs), but emotional content remains poorly understood at the mechanistic level. We study whether LLMs process emotional valence through dedicated internal structure or through surface token matching. Using activation patching and steering on open-source LLMs, we find that negative and positive valence are processed at different network depths. Negative outcomes localize to early layers while positive outcomes peak at mid-to-late layers. Holding topic fixed while flipping valence produces sign-opposite responses, ruling out topic detection. Steering with the good-news direction at the identified layers shifts neutral prompts toward positive valence, showing these layers encode valence as a manipulable direction. Emotional valence in LLMs is localized, causal and steerable, making it a concrete target for interpretability-based oversight.

View on arXiv PDF

Similar