NormXLogit: The Head-on-Top Never Lies
This provides a more efficient and interpretable approach for researchers and practitioners working with diverse LLM architectures, though it is incremental in improving existing interpretability techniques.
The paper tackles the problem of interpretability in large language models by proposing NormXLogit, a model-agnostic method for assessing token significance, which outperforms gradient-based methods in faithfulness and offers competitive performance in layer-wise explanations.
With new large language models (LLMs) emerging frequently, it is important to consider the potential value of model-agnostic approaches that can provide interpretability across a variety of architectures. While recent advances in LLM interpretability show promise, many rely on complex, model-specific methods with high computational costs. To address these limitations, we propose NormXLogit, a novel technique for assessing the significance of individual input tokens. This method operates based on the input and output representations associated with each token. First, we demonstrate that during the pre-training of LLMs, the norms of word embeddings effectively capture token importance. Second, we reveal a significant relationship between a token's importance and the extent to which its representation can resemble the model's final prediction. Extensive analyses reveal that our approach outperforms existing gradient-based methods in terms of faithfulness and offers competitive performance in layer-wise explanations compared to leading architecture-specific techniques.