CL LGFeb 27, 2024

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, Atticus Geiger

DeepMindStanford

arXiv:2402.17700v224.670 citationsh-index: 32Has CodeACL

Originality Incremental advance

AI Analysis

This work addresses the challenge of quantitatively comparing interpretability methods for researchers in AI and NLP, though it is incremental as it builds on existing methods.

The authors tackled the problem of evaluating interpretability methods for disentangling language model representations by introducing the RAVEL dataset, and their new method MDAS achieved state-of-the-art results on this benchmark with Llama2-7B.

Individual neurons participate in the representation of multiple high-level concepts. To what extent can different interpretability methods successfully disentangle these roles? To help address this question, we introduce RAVEL (Resolving Attribute-Value Entanglements in Language Models), a dataset that enables tightly controlled, quantitative comparisons between a variety of existing interpretability methods. We use the resulting conceptual framework to define the new method of Multi-task Distributed Alignment Search (MDAS), which allows us to find distributed representations satisfying multiple causal criteria. With Llama2-7B as the target language model, MDAS achieves state-of-the-art results on RAVEL, demonstrating the importance of going beyond neuron-level analyses to identify features distributed across activations. We release our benchmark at https://github.com/explanare/ravel.

View on arXiv PDF Code

Similar