LGMay 4

Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes

Michael A. Riegler, Birk Sebastian Frostelid Torpmann-Hagen

arXiv:2605.0316079.1

AI Analysis

For interpretability researchers, it demonstrates that standard SAE protocols are insufficient for causal attribution, offering a more rigorous method to identify true causal features.

The paper introduces a pairwise matrix protocol for sparse autoencoder interpretability, revealing that single-feature inspection mislabels causal axes. On Qwen3-1.7B-Instruct and Gemma-2-2B-it, it finds three phenomena missed by standard protocols: inverted U-shape steering effects, joint suppression damaging multiple capabilities, and direction-pattern-dependent coherence loss with ~10x CI separation.

The standard sparse-autoencoder (SAE) interpretability protocol labels each feature from its top-activating contexts and validates by single-feature steering. We propose the pairwise matrix protocol, co-varying steering coefficient with joint condition, and report three findings the standard one-corner protocol misses on Qwen3-1.7B-Instruct, replicated on Gemma-2-2B-it. First, a feature labelled "AI self-disclaimer" from its top contexts produces an inverted U-shape under a coefficient sweep: at c=+500 the model substitutes a fluent contemplative-philosopher voice for the disclaimer. Two further features anchor the criterion (one monotonic, one pure breakdown). Second, three near-orthogonal cluster-specific features that individually steer a philosophy-of-mind register, jointly suppressed at c=-500, damage grounded composition on recipes and engine explanations as well as introspective prompts; single-feature suppression at the same magnitude leaves controls intact. Third, a matched-geometry comparison of single-feature, joint, and random-direction perturbations (norm ~1.55, cosine ~0.64) yields three distinct output regimes: single-feature substitutes strategy filler, random direction substitutes diverse content, joint suppression alone produces placeholder text. Coherence loss is direction-pattern-dependent, not magnitude-dependent. All three findings reproduce on Gemma with model-specific damage signatures; the matched-geometry control is CI-separated by ~10x. The pipeline also locates a top causally responsible feature in Llama-3.1-8B-Instruct.

View on arXiv PDF

Similar