LGAICLOct 9, 2025

Time-Aware Feature Selection: Adaptive Temporal Masking for Stable Sparse Autoencoder Training

arXiv:2510.08855v11 citations
Originality Incremental advance
AI Analysis

This addresses a specific bottleneck in interpretability methods for large language models, though it appears incremental relative to existing SAE approaches.

The paper tackles the problem of feature absorption in sparse autoencoder training for LLM interpretability, introducing Adaptive Temporal Masking (ATM) which achieves substantially lower absorption scores while maintaining excellent reconstruction quality on the Gemma-2-2b model.

Understanding the internal representations of large language models is crucial for ensuring their reliability and safety, with sparse autoencoders (SAEs) emerging as a promising interpretability approach. However, current SAE training methods face feature absorption, where features (or neurons) are absorbed into each other to minimize $L_1$ penalty, making it difficult to consistently identify and analyze model behaviors. We introduce Adaptive Temporal Masking (ATM), a novel training approach that dynamically adjusts feature selection by tracking activation magnitudes, frequencies, and reconstruction contributions to compute importance scores that evolve over time. ATM applies a probabilistic masking mechanism based on statistical thresholding of these importance scores, creating a more natural feature selection process. Through extensive experiments on the Gemma-2-2b model, we demonstrate that ATM achieves substantially lower absorption scores compared to existing methods like TopK and JumpReLU SAEs, while maintaining excellent reconstruction quality. These results establish ATM as a principled solution for learning stable, interpretable features in neural networks, providing a foundation for more reliable model analysis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes