CLAIOct 28, 2024

Group-SAE: Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups

arXiv:2410.21508v210 citationsh-index: 2EMNLP
Originality Incremental advance
AI Analysis

This addresses the scalability problem for researchers analyzing LLM representations, though it is an incremental improvement on existing SAE training methods.

The paper tackles the computational inefficiency of training separate sparse autoencoders (SAEs) for each layer in large language models by proposing Group-SAE, which groups similar layers to train one SAE per group, reducing training time by 40-60% while maintaining comparable reconstruction quality and interpretability.

SAEs have recently been employed as a promising unsupervised approach for understanding the representations of layers of Large Language Models (LLMs). However, with the growth in model size and complexity, training SAEs is computationally intensive, as typically one SAE is trained for each model layer. To address such limitation, we propose \textit{Group-SAE}, a novel strategy to train SAEs. Our method considers the similarity of the residual stream representations between contiguous layers to group similar layers and train a single SAE per group. To balance the trade-off between efficiency and performance, we further introduce \textit{AMAD} (Average Maximum Angular Distance), an empirical metric that guides the selection of an optimal number of groups based on representational similarity across layers. Experiments on models from the Pythia family show that our approach significantly accelerates training with minimal impact on reconstruction quality and comparable downstream task performance and interpretability over baseline SAEs trained layer by layer. This method provides an efficient and scalable strategy for training SAEs in modern LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes