LGOct 16, 2024

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

arXiv:2410.12555v24 citationsh-index: 3
Originality Synthesis-oriented
AI Analysis

This provides incremental improvements for researchers studying interpretability in language models.

The paper tackled the problem of understanding computational features in language models by investigating sensitive directions in GPT-2, introducing an improved baseline for perturbation directions. The results showed that KL divergence for Sparse Autoencoder reconstruction errors are no longer pathologically high compared to the baseline, and feature directions from SAEs have varying impacts depending on sparsity, with lower L0 SAE features exerting greater influence.

Sensitive directions experiments attempt to understand the computational features of Language Models (LMs) by measuring how much the next token prediction probabilities change by perturbing activations along specific directions. We extend the sensitive directions work by introducing an improved baseline for perturbation directions. We demonstrate that KL divergence for Sparse Autoencoder (SAE) reconstruction errors are no longer pathologically high compared to the improved baseline. We also show that feature directions uncovered by SAEs have varying impacts on model outputs depending on the SAE's sparsity, with lower L0 SAE feature directions exerting a greater influence. Additionally, we find that end-to-end SAE features do not exhibit stronger effects on model outputs compared to traditional SAEs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes