LG AI CLNov 13, 2024

Can sparse autoencoders be used to decompose and interpret steering vectors?

arXiv:2411.08790v119.817 citationsh-index: 5Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses a key interpretability challenge for AI researchers, but it is incremental as it identifies limitations rather than proposing a new solution.

The paper tackled the problem of interpreting steering vectors in large language models using sparse autoencoders, finding that direct application fails due to distribution mismatches and negative projections, which prevents accurate decomposition.

Steering vectors are a promising approach to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While sparse autoencoders (SAEs) may offer a potential method to interpret steering vectors, recent findings show that SAE-reconstructed vectors often lack the steering properties of the original vectors. This paper investigates why directly applying SAEs to steering vectors yields misleading decompositions, identifying two reasons: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs are not designed to accommodate. These limitations hinder the direct use of SAEs for interpreting steering vectors.

View on arXiv PDF Code

Similar