Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations
This highlights a critical vulnerability in interpretability methods for LLMs, which could undermine their use in model monitoring and oversight.
The paper tackled the problem of evaluating the robustness of concept representations in sparse autoencoders (SAEs) used for interpreting large language models (LLMs), finding that tiny adversarial input perturbations can effectively manipulate these interpretations without affecting the base LLM outputs.
Sparse autoencoders (SAEs) are commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations. While existing evaluations of SAEs focus on metrics such as the reconstruction-sparsity tradeoff, human (auto-)interpretability, and feature disentanglement, they overlook a critical aspect: the robustness of concept representations to input perturbations. We argue that robustness must be a fundamental consideration for concept representations, reflecting the fidelity of concept labeling. To this end, we formulate robustness quantification as input-space optimization problems and develop a comprehensive evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations. Empirically, we find that tiny adversarial input perturbations can effectively manipulate concept-based interpretations in most scenarios without notably affecting the outputs of the base LLMs themselves. Overall, our results suggest that SAE concept representations are fragile and may be ill-suited for applications in model monitoring and oversight.