Rigorously Assessing Natural Language Explanations of Neurons
This addresses the challenge of interpretability in AI for researchers and practitioners, but it is incremental as it builds on prior work to provide a more rigorous evaluation framework.
The paper tackles the problem of evaluating the faithfulness of natural language explanations for neurons in large language models, developing observational and intervention modes of assessment. It applies this framework to GPT-4-generated explanations for GPT-2 XL neurons, finding high error rates and low causal efficacy, with error rates up to 90% in some cases.
Natural language is an appealing medium for explaining how large language models process and store information, but evaluating the faithfulness of such explanations is challenging. To help address this, we develop two modes of evaluation for natural language explanations that claim individual neurons represent a concept in a text input. In the observational mode, we evaluate claims that a neuron $a$ activates on all and only input strings that refer to a concept picked out by the proposed explanation $E$. In the intervention mode, we construe $E$ as a claim that the neuron $a$ is a causal mediator of the concept denoted by $E$. We apply our framework to the GPT-4-generated explanations of GPT-2 XL neurons of Bills et al. (2023) and show that even the most confident explanations have high error rates and little to no causal efficacy. We close the paper by critically assessing whether natural language is a good choice for explanations and whether neurons are the best level of analysis.