AIMay 7

From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features

Ruben Fernandez-Boullon, Pablo Magariños-Docampo, Javier Perez-Robles

arXiv:2605.0649450.7

Predicted impact top 72% in AI · last 90 daysOriginality Synthesis-oriented

AI Analysis

For mechanistic interpretability researchers, this provides a complementary structural view of SAE features, but the contribution is incremental as it does not outperform existing methods.

The paper introduces a graph-structured representation for SAE features based on token co-occurrence, using a WL-style kernel for clustering. Applied to GPT-2 Small features, it recovers heuristic motif families not found by decoder cosine similarity, though a token-histogram baseline achieves higher purity.

Sparse autoencoders (SAEs) have become central to mechanistic interpretability, decomposing transformer activations into monosemantic features. Yet existing analyses characterise features almost exclusively through top-activating token lists or decoder weight vectors, leaving the higher-order co-occurrence structure shared across features largely unexamined. We introduce a graph-structured representation in which each SAE feature is modelled as a token co-occurrence graph: nodes are the tokens most frequent near strong activations, and edges connect pairs that co-occur within local context windows. A custom WL-style, frequency-binned graph kernel then provides a similarity measure over this structural space. Applied as a proof of concept to features from a large SAE trained on GPT-2 Small and probed with a synthetic mixed-domain corpus, our clustering recovers heuristic motif families (punctuation-heavy patterns, language and script clusters, and code-like templates) that are not recovered by clustering on decoder cosine similarity. A token-histogram baseline achieves higher overall purity, so the contribution of the graph view is complementary rather than dominant: it surfaces structural relationships that token-frequency and decoder-weight views alone do not capture. Cluster assignments are stable across graph-construction hyperparameters and random seeds.

View on arXiv PDF

Similar