LGMay 29

Toward Identifiable Sparse Autoencoders

Walter Nelson, Theofanis Karaletsos, Francesco Locatello

arXiv:2605.3124582.8

AI Analysis

This work tackles the problem of instability in Sparse Autoencoders, which is crucial for researchers and practitioners aiming to interpret and interact with neural network representations reliably.

This paper addresses the instability of Sparse Autoencoders (SAEs), where different training runs yield varied concept dictionaries and sparse codes. The authors characterize the model properties causing this instability and propose minimal architectural and training procedure changes, resulting in an "identifiable SAE" (iSAE) that exhibits lower reconstruction error and improved stability.

Recently, sparse autoencoders (SAEs) have emerged as an attractive tool for interpreting and interacting with representations in practical neural networks. While it is common empirical folklore, we also show theoretically that SAEs are highly unstable: different training runs are likely to produce different concept dictionaries and sparse codes. We characterize the model properties that hinder the stability of real-world SAEs, and address each of these problems through minimal changes to the architecture and training procedure. Together, these changes yield two versions of an \textbf{i}dentifiable SAE (iSAE), a variant of the standard TopK SAE with lower reconstruction error and improved stability. We explain this improvement theoretically by connecting SAEs with traditional dictionary learning approaches, and show that the dictionaries learned in practice satisfy an approximate restricted isometry condition, rendering the corresponding sparse codes in those models near-identifiable.

View on arXiv PDF

Similar