LGAICLMay 12, 2025

Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

arXiv:2505.08080v23 citationsh-index: 17EMNLP
AI Analysis

This work addresses the challenge of causal interpretation in model steering for AI researchers, but it appears incremental as it builds on existing sparse autoencoder methods.

The paper tackled the problem of identifying influential latent features in sparse autoencoders for interpreting large language models, proposing Gradient Sparse Autoencoder (GradSAE) to incorporate gradient information and validate that only high-influence latents are effective for steering.

Sparse Autoencoders (SAEs) have recently emerged as powerful tools for interpreting and steering the internal representations of large language models (LLMs). However, conventional approaches to analyzing SAEs typically rely solely on input-side activations, without considering the causal influence between each latent feature and the model's output. This work is built on two key hypotheses: (1) activated latents do not contribute equally to the construction of the model's output, and (2) only latents with high causal influence are effective for model steering. To validate these hypotheses, we propose Gradient Sparse Autoencoder (GradSAE), a simple yet effective method that identifies the most influential latents by incorporating output-side gradient information.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes