CLFeb 18, 2025

The Knowledge Microscope: Features as Better Analytical Lenses than Neurons

arXiv:2502.12483v25 citationsh-index: 28ACL
Originality Incremental advance
AI Analysis

This work addresses interpretability and privacy issues in AI for researchers and practitioners, representing an incremental improvement by refining existing analytical approaches.

The paper tackles the problem of understanding factual knowledge in Language Models by proposing features from Sparse Autoencoders as analytical units instead of neurons, showing that features offer stronger influence on knowledge expression, better interpretability, enhanced monosemanticity, and improved privacy protection with FeatureEdit outperforming neuron-based methods.

Previous studies primarily utilize MLP neurons as units of analysis for understanding the mechanisms of factual knowledge in Language Models (LMs); however, neurons suffer from polysemanticity, leading to limited knowledge expression and poor interpretability. In this paper, we first conduct preliminary experiments to validate that Sparse Autoencoders (SAE) can effectively decompose neurons into features, which serve as alternative analytical units. With this established, our core findings reveal three key advantages of features over neurons: (1) Features exhibit stronger influence on knowledge expression and superior interpretability. (2) Features demonstrate enhanced monosemanticity, showing distinct activation patterns between related and unrelated facts. (3) Features achieve better privacy protection than neurons, demonstrated through our proposed FeatureEdit method, which significantly outperforms existing neuron-based approaches in erasing privacy-sensitive information from LMs.Code and dataset will be available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes