LG AIMay 27

Semantic Optimal Transport for Sparse Autoencoder Feature Matching and Circuit Compression

arXiv:2605.2856757.6

AI Analysis

This work addresses the scalability of SAE-based interpretability for language models by unifying two key analyses, enabling more efficient and accurate cross-layer feature matching and circuit compression.

The paper introduces a distributional framework for matching semantically similar features across layers and compressing feature circuits in sparse autoencoders (SAEs) for language model interpretability. The method outperforms baselines and automatically compresses circuits into interpretable supernodes.

Sparse autoencoders (SAEs) have become a central tool for interpreting language models. However, two key SAE analyses that remain difficult to scale are (1) matching semantically similar features across multi-layers and (2) compressing large feature circuits into interpretable supernodes. Although these have been treated as separate problems, we show that both are instances of a more fundamental challenge, which we frame as the estimation of semantic distances between SAE features that lie on different activation manifolds. We introduce a distributional framework for this problem, in which each feature is represented not by a single decoder vector like in the literature, but by an activation-weighted distribution over the hidden states that express it. By projecting these distributions into a shared reference space and comparing them with Wasserstein distance, our method provides a unified semantic metric for cross-layer feature comparison. We prove that our representation is invariant to activation rescaling, stable under perturbations, and recovers true matches under finite-sample margin conditions. Empirically, our method outperforms decoder-vector and LLM-based baselines and captures subtle functional distinctions between related features. Notably, our method compresses large feature circuits into interpretable supernodes automatically.

View on arXiv PDF

Similar