CLAIFeb 8, 2025

Mechanistic Interpretability of Emotion Inference in Large Language Models

arXiv:2502.05489v223 citationsh-index: 12ACL
Originality Highly original
AI Analysis

This work addresses the problem of understanding and controlling emotion inference in LLMs for researchers and practitioners in affective computing and AI safety.

The study investigated how autoregressive large language models (LLMs) process emotional stimuli, finding that emotion representations are functionally localized to specific model regions and are psychologically plausible according to cognitive appraisal theory. By causally intervening on appraisal concepts, the researchers demonstrated the ability to steer emotional text generation in alignment with theoretical expectations.

Large language models (LLMs) show promising capabilities in predicting human emotions from text. However, the mechanisms through which these models process emotional stimuli remain largely unexplored. Our study addresses this gap by investigating how autoregressive LLMs infer emotions, showing that emotion representations are functionally localized to specific regions in the model. Our evaluation includes diverse model families and sizes and is supported by robustness checks. We then show that the identified representations are psychologically plausible by drawing on cognitive appraisal theory, a well-established psychological framework positing that emotions emerge from evaluations (appraisals) of environmental stimuli. By causally intervening on construed appraisal concepts, we steer the generation and show that the outputs align with theoretical and intuitive expectations. This work highlights a novel way to causally intervene and precisely shape emotional text generation, potentially benefiting safety and alignment in sensitive affective domains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes