LG AIJun 2

How Quantization Changes Interpretable Features: A Sparse Autoencoder Analysis of Language Models

arXiv:2606.0300213.0

Predicted impact top 48% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners deploying quantized models, this shows that behavioral parity (e.g., perplexity) is insufficient to guarantee that interpretability tools and safety interventions remain valid.

Quantization degrades interpretable sparse autoencoder features in language models before task metrics show damage; at INT6, 62.4% of features survive in Pythia-70M and 51.3% in Gemma-2-2B, and survival is predictable from full-precision statistics with AUC 0.92-0.97.

Quantization is a standard path to deploying large language models, and a quantized model is typically judged acceptable when its perplexity or downstream accuracy stays close to the full-precision original. Whether the model still computes in the same way, or whether the interpretable features identified in the full-precision model survive weight rounding, is rarely tested, even as safety audits and steering interventions increasingly rely on those features. We ask whether sparse autoencoder (SAE) features extracted from a dense full-precision model remain faithful once that model is quantized. Using a frozen SAE as a fixed measurement basis, we encode full-precision and round-to-nearest (RTN) quantized activations on identical tokens and quantify per-feature survival by Pearson correlation, sweeping bit-widths from INT8 to INT4 on Pythia-70M and Gemma-2-2B. We find that feature survival is graded: features degrade systematically rather than failing all at once, with 62.4 percent of active features surviving at INT6 on Pythia-70M and 51.3 percent surviving at INT6 on Gemma-2-2B, and with most non-survivors blurred rather than destroyed. Survival is predictable from full-precision statistics alone, with cross-validated AUCs of 0.92 to 0.97 and peak activation as the strongest marginal predictor. Critically, task metrics can miss this damage: on Gemma-2-2B, INT7 improves perplexity while degrading 18.7 percent of features. Finally, quantization and matched-perplexity magnitude pruning damage strongly overlapping feature sets, with Jaccard overlap of 0.79 to 0.86 and damage-score Spearman correlation of 0.98, suggesting a shared mode of compression-induced vulnerability. These results show that behavioral parity is insufficient evidence that interpretability findings transfer to quantized deployments, motivating feature-level audits of compression.

View on arXiv PDF

Similar