LG CLMay 3

Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs

arXiv:2605.0814950.9

AI Analysis

For researchers studying interpretability and uncertainty in LLMs, this work provides a novel mechanistic link between SAE feature interactions and model uncertainty, though the predictive performance is incremental compared to existing confidence measures.

This paper introduces Feature Rivalry in SAE representations as a mechanistic signature of model uncertainty in LLMs. It shows that high-entropy questions produce significantly stronger rivalry at specific layers, and a rivalry-based score predicts answer correctness with AUROC=0.689, approaching softmax confidence (0.808).

Sparse Autoencoders (SAEs) decompose large language model representations into interpretable features, but how these features interact under uncertainty remains poorly understood. We introduce Feature Rivalry -- negatively correlated SAE feature pairs -- and study whether rivalry serves as a mechanistic signature of model uncertainty in Gemma-2-2B using Gemma Scope SAEs. Through a controlled within-domain experiment on PopQA split by response entropy, we find that high-entropy questions produce significantly stronger feature rivalry at layers 0 and 12 relative to low-entropy questions (p=5.3x10^-26 and p=5.8x10^-5 respectively), localizing uncertainty to specific processing stages in the residual stream. We then test whether rivalry is causally upstream of model outputs via activation steering along rivalry axes -- finding that steering along the rivalry direction (vec_A - vec_B) causes more output changes than random directions at low steering multipliers across 15 of 20 rival feature pairs. Finally, a per-prompt rivalry score derived from pairwise cosine similarities of active SAE feature decoder vectors predicts answer correctness (AUROC=0.689), approaching but not matching softmax confidence (AUROC=0.808).

View on arXiv PDF

Similar