LGAICLMLOct 4, 2025

Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders

arXiv:2510.03659v14 citationsh-index: 4
Originality Highly original
AI Analysis

This addresses a fundamental assumption in interpretable AI for LLM steering, showing interpretability is insufficient for utility, which is crucial for researchers and practitioners developing steering methods.

The paper investigates whether higher interpretability in Sparse Autoencoders (SAEs) leads to better steering utility for large language models (LLMs), finding only a weak positive correlation (tau b ≈ 0.298) and proposing a new feature selection method (Delta Token Confidence) that improves steering performance by 52.52% compared to existing criteria.

Sparse Autoencoders (SAEs) are widely used to steer large language models (LLMs), based on the assumption that their interpretable features naturally enable effective model behavior steering. Yet, a fundamental question remains unanswered: does higher interpretability indeed imply better steering utility? To answer this question, we train 90 SAEs across three LLMs (Gemma-2-2B, Qwen-2.5-3B, Gemma-2-9B), spanning five architectures and six sparsity levels, and evaluate their interpretability and steering utility based on SAEBench (arXiv:2501.12345) and AxBench (arXiv:2502.23456) respectively, and perform a rank-agreement analysis via Kendall's rank coefficients (tau b). Our analysis reveals only a relatively weak positive association (tau b approx 0.298), indicating that interpretability is an insufficient proxy for steering performance. We conjecture the interpretability utility gap may stem from the selection of SAE features, as not all of them are equally effective for steering. To further find features that truly steer the behavior of LLMs, we propose a novel selection criterion called Delta Token Confidence, which measures how much amplifying a feature changes the next token distribution. We show that our method improves the steering performance of three LLMs by 52.52 percent compared to the current best output score based criterion (arXiv:2503.34567). Strikingly, after selecting features with high Delta Token Confidence, the correlation between interpretability and utility vanishes (tau b approx 0), and can even become negative. This further highlights the divergence between interpretability and utility for the most effective steering features.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes