Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
This addresses a theoretical gap in sparse autoencoders for interpretability research, offering a more automated approach to hyperparameter tuning, though it appears incremental as it builds upon existing SAE frameworks.
The paper tackles the problem of selecting the hyperparameter k in sparse autoencoders (SAEs) for mechanistic interpretability of large language models, by developing a theoretical link that approximates the ℓ₂-norm of sparse feature vectors with closed-form error, enabling SAE training without manual k selection and introducing a new evaluation method and activation function (top-AFA) that dynamically determines activated features per input, achieving competitive results on GPT2 hidden embeddings with over 80 million tokens.
Sparse autoencoders (SAEs) are widely used in mechanistic interpretability research for large language models; however, the state-of-the-art method of using $k$-sparse autoencoders lacks a theoretical grounding for selecting the hyperparameter $k$ that represents the number of nonzero activations, often denoted by $\ell_0$. In this paper, we reveal a theoretical link that the $\ell_2$-norm of the sparse feature vector can be approximated with the $\ell_2$-norm of the dense vector with a closed-form error, which allows sparse autoencoders to be trained without the need to manually determine $\ell_0$. Specifically, we validate two applications of our theoretical findings. First, we introduce a new methodology that can assess the feature activations of pre-trained SAEs by computing the theoretically expected value from the input embedding, which has been overlooked by existing SAE evaluation methods and loss functions. Second, we introduce a novel activation function, top-AFA, which builds upon our formulation of approximate feature activation (AFA). This function enables top-$k$ style activation without requiring a constant hyperparameter $k$ to be tuned, dynamically determining the number of activated features for each input. By training SAEs on three intermediate layers to reconstruct GPT2 hidden embeddings for over 80 million tokens from the OpenWebText dataset, we demonstrate the empirical merits of this approach and compare it with current state-of-the-art $k$-sparse autoencoders. Our code is available at: https://github.com/SewoongLee/top-afa-sae.