LG CLOct 27, 2024

Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders

Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang, Xipeng Qiu

arXiv:2410.20526v138.3127 citationsh-index: 57Has Code

Originality Synthesis-oriented

AI Analysis

This work advances the open-source SAE ecosystem for mechanistic interpretability research by providing scalable tools and checkpoints, though it is incremental in applying existing methods to a new model.

The paper tackled the challenge of scalable training for Sparse Autoencoders (SAEs) by introducing a suite of 256 SAEs with 32K and 128K features trained on the Llama-3.1-8B-Base model, evaluating modifications to Top-K SAEs and analyzing feature splitting to discover new features.

Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models, yet scalable training remains a significant challenge. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. Modifications to a state-of-the-art SAE variant, Top-K SAEs, are evaluated across multiple dimensions. In particular, we assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models. Additionally, we analyze the geometry of learned SAE latents, confirming that \emph{feature splitting} enables the discovery of new features. The Llama Scope SAE checkpoints are publicly available at~\url{https://huggingface.co/fnlp/Llama-Scope}, alongside our scalable training, interpretation, and visualization tools at \url{https://github.com/OpenMOSS/Language-Model-SAEs}. These contributions aim to advance the open-source Sparse Autoencoder ecosystem and support mechanistic interpretability research by reducing the need for redundant SAE training.

View on arXiv PDF Code

Similar