On the transferability of Sparse Autoencoders for interpreting compressed models
This addresses the efficiency and interpretability challenges for compressed LLMs, but it is incremental as it builds on existing SAE methods.
The paper tackles the problem of how model compression affects interpretability, specifically using Sparse Autoencoders (SAEs), and finds that SAEs trained on original models can interpret compressed models with slight performance degradation, and pruning the original SAE achieves comparable performance to training new SAEs, reducing training costs.
Modern LLMs face inference efficiency challenges due to their scale. To address this, many compression methods have been proposed, such as pruning and quantization. However, the effect of compression on a model's interpretability remains elusive. While several model interpretation approaches exist, such as circuit discovery, Sparse Autoencoders (SAEs) have proven particularly effective in decomposing a model's activation space into its feature basis. In this work, we explore the differences in SAEs for the original and compressed models. We find that SAEs trained on the original model can interpret the compressed model albeit with slight performance degradation compared to the trained SAE on the compressed model. Furthermore, simply pruning the original SAE itself achieves performance comparable to training a new SAE on the pruned model. This finding enables us to mitigate the extensive training costs of SAEs.