CLAIDec 7, 2025

Mechanistic Interpretability of GPT-2: Lexical and Contextual Layers in Sentiment Analysis

arXiv:2512.06681v1h-index: 6
Originality Incremental advance
AI Analysis

This provides causal evidence for mechanistic interpretability in large language models, addressing a foundational problem for AI researchers, though it is incremental in refining existing hypotheses.

The study tackled the problem of understanding how GPT-2 processes sentiment information across its layers, finding that early layers detect lexical sentiment independently of context, while contextual integration occurs primarily in late layers through a unified mechanism, falsifying hypothesized mid-layer specialization.

We present a mechanistic interpretability study of GPT-2 that causally examines how sentiment information is processed across its transformer layers. Using systematic activation patching across all 12 layers, we test the hypothesized two-stage sentiment architecture comprising early lexical detection and mid-layer contextual integration. Our experiments confirm that early layers (0-3) act as lexical sentiment detectors, encoding stable, position specific polarity signals that are largely independent of context. However, all three contextual integration hypotheses: Middle Layer Concentration, Phenomenon Specificity, and Distributed Processing are falsified. Instead of mid-layer specialization, we find that contextual phenomena such as negation, sarcasm, domain shifts etc. are integrated primarily in late layers (8-11) through a unified, non-modular mechanism. These experimental findings provide causal evidence that GPT-2's sentiment computation differs from the predicted hierarchical pattern, highlighting the need for further empirical characterization of contextual integration in large language models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes