Empirical Evaluation of Progressive Coding for Sparse Autoencoders
This work addresses a computational bottleneck for researchers and practitioners using SAEs in applications like representation engineering and information retrieval, but it is incremental as it builds on existing SAE and Matryoshka methods.
The paper tackles the computational expense of training multiple sparse autoencoders (SAEs) of different sizes by comparing progressive coding via subset pruning with jointly training nested (Matryoshka) SAEs on a language modeling task, finding that Matryoshka SAEs achieve lower reconstruction and language modeling losses and higher representational similarity, while pruned vanilla SAEs are more interpretable.
Sparse autoencoders (SAEs) \citep{bricken2023monosemanticity,gao2024scalingevaluatingsparseautoencoders} rely on dictionary learning to extract interpretable features from neural networks at scale in an unsupervised manner, with applications to representation engineering and information retrieval. SAEs are, however, computationally expensive \citep{lieberum2024gemmascopeopensparse}, especially when multiple SAEs of different sizes are needed. We show that dictionary importance in vanilla SAEs follows a power law. We compare progressive coding based on subset pruning of SAEs -- to jointly training nested SAEs, or so-called {\em Matryoshka} SAEs \citep{bussmann2024learning,nabeshima2024Matryoshka} -- on a language modeling task. We show Matryoshka SAEs exhibit lower reconstruction loss and recaptured language modeling loss, as well as higher representational similarity. Pruned vanilla SAEs are more interpretable, however. We discuss the origins and implications of this trade-off.