CLJun 24, 2025

Measuring and Guiding Monosemanticity

Ruben Härle, Felix Friedrich, Manuel Brack, Stephan Wäldchen, Björn Deiseroth, Patrick Schramowski, Kristian Kersting

arXiv:2506.19382v113.05 citationsh-index: 25

Originality Incremental advance

AI Analysis

This work addresses the challenge of reliably localizing and manipulating feature representations in LLMs for researchers and practitioners in mechanistic interpretability, though it is incremental as it builds on existing SAE methods.

The paper tackled the problem of incomplete feature isolation and unreliable monosemanticity in Sparse Autoencoders (SAEs) for large language models, introducing a Feature Monosemanticity Score (FMS) and Guided Sparse Autoencoders (G-SAE) to improve interpretability and control, with evaluations showing enhanced monosemanticity and more effective steering in tasks like toxicity detection and writing style identification.

There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in reliably localizing and manipulating feature representations. Sparse Autoencoders (SAEs) have recently emerged as a promising direction for feature extraction at scale, yet they, too, are limited by incomplete feature isolation and unreliable monosemanticity. To systematically quantify these limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric to quantify feature monosemanticity in latent representation. Building on these insights, we propose Guided Sparse Autoencoders (G-SAE), a method that conditions latent representations on labeled concepts during training. We demonstrate that reliable localization and disentanglement of target concepts within the latent space improve interpretability, detection of behavior, and control. Specifically, our evaluations on toxicity detection, writing style identification, and privacy attribute recognition show that G-SAE not only enhances monosemanticity but also enables more effective and fine-grained steering with less quality degradation. Our findings provide actionable guidelines for measuring and advancing mechanistic interpretability and control of LLMs.

View on arXiv PDF

Similar