Noiser: Bounded Input Perturbations for Attributing Large Language Models
This work addresses the need for more reliable explanations of LLM predictions, which is crucial for users in AI interpretability, though it is incremental as it builds on existing perturbation-based approaches.
The paper tackled the problem of generating faithful feature attributions for Large Language Models by introducing Noiser, a perturbation-based method that applies bounded noise to input embeddings and measures model robustness, and demonstrated that it consistently outperforms existing methods in faithfulness and answerability across six LLMs and three tasks.
Feature attribution (FA) methods are common post-hoc approaches that explain how Large Language Models (LLMs) make predictions. Accordingly, generating faithful attributions that reflect the actual inner behavior of the model is crucial. In this paper, we introduce Noiser, a perturbation-based FA method that imposes bounded noise on each input embedding and measures the robustness of the model against partially noised input to obtain the input attributions. Additionally, we propose an answerability metric that employs an instructed judge model to assess the extent to which highly scored tokens suffice to recover the predicted output. Through a comprehensive evaluation across six LLMs and three tasks, we demonstrate that Noiser consistently outperforms existing gradient-based, attention-based, and perturbation-based FA methods in terms of both faithfulness and answerability, making it a robust and effective approach for explaining language model predictions.