LGCLOct 27, 2024

Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders

arXiv:2410.20526v1121 citationsh-index: 57Has Code
Originality Synthesis-oriented
AI Analysis

This work advances the open-source SAE ecosystem for mechanistic interpretability research by providing scalable tools and checkpoints, though it is incremental in applying existing methods to a new model.

The paper tackled the challenge of scalable training for Sparse Autoencoders (SAEs) by introducing a suite of 256 SAEs with 32K and 128K features trained on the Llama-3.1-8B-Base model, evaluating modifications to Top-K SAEs and analyzing feature splitting to discover new features.

Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models, yet scalable training remains a significant challenge. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. Modifications to a state-of-the-art SAE variant, Top-K SAEs, are evaluated across multiple dimensions. In particular, we assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models. Additionally, we analyze the geometry of learned SAE latents, confirming that \emph{feature splitting} enables the discovery of new features. The Llama Scope SAE checkpoints are publicly available at~\url{https://huggingface.co/fnlp/Llama-Scope}, alongside our scalable training, interpretation, and visualization tools at \url{https://github.com/OpenMOSS/Language-Model-SAEs}. These contributions aim to advance the open-source Sparse Autoencoder ecosystem and support mechanistic interpretability research by reducing the need for redundant SAE training.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes