LG AIMay 27

ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions

Prathyush Poduval, Calvin Yeung, Neel Desai, Mohsen Imani

arXiv:2605.2781960.7h-index: 10

Predicted impact top 40% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers using sparse autoencoders to interpret transformer models, ReSAEs provide a simple method to improve multi-layer interventions by removing linearly predictable cross-layer structure.

Residualized Sparse Autoencoders (ReSAEs) reduce decoder redundancy across layers by training later-layer SAEs on residuals after removing linearly predictable cross-layer structure, improving multi-layer intervention fidelity. On Pythia-1.4B and Gemma-2-9B, ReSAEs recover more transformer cross-entropy under multi-layer replacement despite reconstructing less raw variance.

Sparse autoencoders are usually trained one layer at a time, even though transformer residual stream activations are strongly coupled across depth. This creates a practical problem for multi-layer interventions: different layerwise dictionaries can spend capacity representing the same carried-forward information, and replacing several layers at once can produce interactions that are not predicted by single-layer behavior. We introduce Residualized Sparse Autoencoders (ReSAEs), which fit an affine map between selected layers and train each later-layer SAE on the unexplained residual rather than on the full activation. Reconstructions are mapped back into the original activation space through the fitted affine chain, so ReSAEs can be evaluated with the same intervention protocols as ordinary SAEs. On Pythia-1.4B and Gemma-2-9B, residualization reduces decoder redundancy and improves sparse probing and targeted perturbation in most tested settings. Despite reconstructing less of the raw activation variance, ReSAEs recover more transformer cross entropy under multi-layer replacement. This gain is clearest under teacher-forcing and at sufficient sparsity online, indicating that ReSAEs preserve the components of the activation most relevant to the model's downstream computation. These results suggest that removing linearly predictable cross-layer structure is a useful default for multi-layer SAE interventions.

View on arXiv PDF

Similar