LG AIApr 7

Improving Robustness In Sparse Autoencoders via Masked Regularization

Vivek Narayanaswamy, Kowshik Thopalli, Bhavya Kailkhura, Wesam Sakla

arXiv:2604.0649573.4h-index: 12

AI Analysis

This work addresses robustness issues in sparse autoencoders for mechanistic interpretability of LLMs, offering a practical improvement but is incremental as it builds on existing SAE frameworks.

The paper tackled the problem of feature absorption and poor robustness in sparse autoencoders (SAEs) used for mechanistic interpretability, by introducing a masking-based regularization that disrupts co-occurrence patterns during training, which improved robustness, reduced absorption, enhanced probing performance, and narrowed the Out-of-Distribution gap across architectures and sparsity levels.

Sparse autoencoders (SAEs) are widely used in mechanistic interpretability to project LLM activations onto sparse latent spaces. However, sparsity alone is an imperfect proxy for interpretability, and current training objectives often result in brittle latent representations. SAEs are known to be prone to feature absorption, where general features are subsumed by more specific ones due to co-occurrence, degrading interpretability despite high reconstruction fidelity. Recent negative results on Out-of-Distribution (OOD) performance further underscore broader robustness related failures tied to under-specified training objectives. We address this by proposing a masking-based regularization that randomly replaces tokens during training to disrupt co-occurrence patterns. This improves robustness across SAE architectures and sparsity levels reducing absorption, enhancing probing performance, and narrowing the OOD gap. Our results point toward a practical path for more reliable interpretability tools.

View on arXiv PDF

Similar