LGAICLSep 25, 2025

Binary Autoencoder for Mechanistic Interpretability of Large Language Models

arXiv:2509.20997v11 citationsh-index: 21
Originality Incremental advance
AI Analysis

This addresses the challenge of mechanistic interpretability for LLMs, offering a method to better understand model internals, though it appears incremental as it builds on existing autoencoder approaches with a specific sparsity enhancement.

The paper tackles the problem of extracting sparse, interpretable features from large language models (LLMs) by proposing a Binary Autoencoder (BAE) that enforces minimal entropy on minibatches to promote feature independence and sparsity across instances, resulting in BAE producing the largest number of interpretable features among baselines while avoiding dense features.

Existing works are dedicated to untangling atomized numerical components (features) from the hidden states of Large Language Models (LLMs) for interpreting their mechanism. However, they typically rely on autoencoders constrained by some implicit training-time regularization on single training instances (i.e., $L_1$ normalization, top-k function, etc.), without an explicit guarantee of global sparsity among instances, causing a large amount of dense (simultaneously inactive) features, harming the feature sparsity and atomization. In this paper, we propose a novel autoencoder variant that enforces minimal entropy on minibatches of hidden activations, thereby promoting feature independence and sparsity across instances. For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function and apply gradient estimation to enable backpropagation, so that we term it as Binary Autoencoder (BAE) and empirically demonstrate two major applications: (1) Feature set entropy calculation. Entropy can be reliably estimated on binary hidden activations, which we empirically evaluate and leverage to characterize the inference dynamics of LLMs and In-context Learning. (2) Feature untangling. Similar to typical methods, BAE can extract atomized features from LLM's hidden states. To robustly evaluate such feature extraction capability, we refine traditional feature-interpretation methods to avoid unreliable handling of numerical tokens, and show that BAE avoids dense features while producing the largest number of interpretable ones among baselines, which confirms the effectiveness of BAE serving as a feature extractor.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes