Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs
This work addresses the challenge of evaluating model trustworthiness for users relying on LLM outputs, though it is incremental as it builds on existing explainability techniques.
The paper tackles the problem of assessing faithfulness in black-box large language models by introducing a task that uses local perturbations and self-explanations to identify crucial parts for correct answers, validated on the Natural Questions dataset with demonstrated effectiveness.
This paper introduces a novel task to assess the faithfulness of large language models (LLMs) using local perturbations and self-explanations. Many LLMs often require additional context to answer certain questions correctly. For this purpose, we propose a new efficient alternative explainability technique, inspired by the commonly used leave-one-out approach. Using this approach, we identify the sufficient and necessary parts for the LLM to generate correct answers, serving as explanations. We propose a metric for assessing faithfulness that compares these crucial parts with the self-explanations of the model. Using the Natural Questions dataset, we validate our approach, demonstrating its effectiveness in explaining model decisions and assessing faithfulness.