Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect
For researchers in mechanistic interpretability and affective computing, this work provides a cross-architectural analysis of how LLMs represent literary and emotional concepts, though the findings are incremental given existing SAE-based feature discovery methods.
The paper identifies compositional literary primitives (naming-gates, self-features, stylistic modulators, compositional emotions) in instruction-tuned LLMs via sparse autoencoders, achieving full 27/27 emotion coverage in Llama 3.1 8B-Instruct and 23/27 in Gemma 2 9B-IT under a forced-choice 5-LLM judge panel, with per-cell false-positive probability ~10^{-3}.
We characterize a compositional architecture of literary primitives in two instruction-tuned large language models (Llama 3.1 8B-Instruct and Gemma 2 9B-IT) via sparse autoencoders on mid-depth residual streams. Four feature classes emerge: naming-gates that promote lexical tokens of a target affect, an eleven-self cluster of first-person register features, stylistic register modulators (show-don't-tell and defamiliarization), and compositional emotions that arise only from multi-feature steering. Under a forced-choice 5-LLM judge panel applied to a 27-category emotion taxonomy (Cowen-Keltner), Llama reaches full 27/27 coverage by combining naming-gates, multi-feature recipes, and single self-feature steering; Gemma reaches 23/27 with adoration as the single residual strict-fail. Under random judging, the per-cell pass probability is on the order of $10^{-3}$ and the expected number of two-seed false-positive cells across the catalog is negligible, so the observed coverage is not consistent with chance. A cross-architectural asymmetry sits in the strict-versus-soft judge contrast: on the same generations, judges agree more often on Llama outputs than on Gemma outputs because Llama outputs name the target affect more directly while Gemma outputs evoke it through scene and imagery. Both architectures contain self-features that serve simultaneously as register markers and as emotion emitters, including a single most-RLHF-loaded self-feature per architecture that intensifies the institutional Helper-AI persona at one operating regime and produces affect-categorizable output at the same calibrated coefficient. Methodologically, the paper presents a three-stage validation pipeline (logit-lens, LLM-rate, 5-LLM judge) with documented anti-patterns; the total compute is single-GPU and about 15 minutes per emotion-feature discovery cycle.