AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations
This addresses the challenge of interpretability in large language models for researchers and practitioners, though it is an incremental improvement over existing sparse autoencoder methods.
The paper tackled the problem of fixed sparsity constraints in sparse autoencoders for LLM interpretability by proposing AdaptiveK SAE, which dynamically adjusts sparsity based on input complexity, resulting in significant improvements in reconstruction fidelity, explained variance, cosine similarity, and interpretability metrics across ten language models.
Understanding the internal representations of large language models (LLMs) remains a central challenge for interpretability research. Sparse autoencoders (SAEs) offer a promising solution by decomposing activations into interpretable features, but existing approaches rely on fixed sparsity constraints that fail to account for input complexity. We propose AdaptiveK SAE (Adaptive Top K Sparse Autoencoders), a novel framework that dynamically adjusts sparsity levels based on the semantic complexity of each input. Leveraging linear probes, we demonstrate that context complexity is linearly encoded in LLM representations, and we use this signal to guide feature allocation during training. Experiments across ten language models (from 70M to 14B parameters) demonstrate that this complexity-driven adaptation significantly outperforms fixed-sparsity approaches on reconstruction fidelity, explained variance, cosine similarity and interpretability metrics while eliminating the computational burden of extensive hyperparameter tuning.