Tokenized SAEs: Disentangling SAE Reconstructions
This addresses an issue for researchers using SAEs in interpretability of language models, but it is incremental as it builds on existing SAE methods.
The paper tackled the problem that sparse auto-encoder (SAE) features often correspond to simple input statistics rather than computationally important directions in language models, and proposed a method using a per-token bias to disentangle token reconstruction from feature reconstruction, resulting in significantly more interesting features and improved reconstruction in sparse regimes.
Sparse auto-encoders (SAEs) have become a prevalent tool for interpreting language models' inner workings. However, it is unknown how tightly SAE features correspond to computationally important directions in the model. This work empirically shows that many RES-JB SAE features predominantly correspond to simple input statistics. We hypothesize this is caused by a large class imbalance in training data combined with a lack of complex error signals. To reduce this behavior, we propose a method that disentangles token reconstruction from feature reconstruction. This improvement is achieved by introducing a per-token bias, which provides an enhanced baseline for interesting reconstruction. As a result, significantly more interesting features and improved reconstruction in sparse regimes are learned.