LG AIMay 17, 2025

SplInterp: Improving our Understanding and Training of Sparse Autoencoders

Jeremy Budd, Javier Ideami, Benjamin Macdowall Rynne, Keith Duggar, Randall Balestriero

arXiv:2505.11836v14.1h-index: 2Has Code

Originality Incremental advance

AI Analysis

This work addresses theoretical gaps in SAEs for mechanistic interpretability in large language models, offering incremental improvements in training methods.

The paper tackles the limited theoretical understanding of sparse autoencoders (SAEs) by analyzing them through spline theory, revealing they generalize k-means autoencoders but sacrifice accuracy for interpretability compared to an optimal piecewise affine autoencoder, and it introduces a novel PAM-SGD training algorithm that shows improved sample efficiency and sparsity in MNIST and LLM experiments.

Sparse autoencoders (SAEs) have received considerable recent attention as tools for mechanistic interpretability, showing success at extracting interpretable features even from very large LLMs. However, this research has been largely empirical, and there have been recent doubts about the true utility of SAEs. In this work, we seek to enhance the theoretical understanding of SAEs, using the spline theory of deep learning. By situating SAEs in this framework: we discover that SAEs generalise ``$k$-means autoencoders'' to be piecewise affine, but sacrifice accuracy for interpretability vs. the optimal ``$k$-means-esque plus local principal component analysis (PCA)'' piecewise affine autoencoder. We characterise the underlying geometry of (TopK) SAEs using power diagrams. And we develop a novel proximal alternating method SGD (PAM-SGD) algorithm for training SAEs, with both solid theoretical foundations and promising empirical results in MNIST and LLM experiments, particularly in sample efficiency and (in the LLM setting) improved sparsity of codes. All code is available at: https://github.com/splInterp2025/splInterp

View on arXiv PDF Code

Similar