CLAISep 18, 2024

Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs

arXiv:2409.13764v13 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the challenge of evaluating model trustworthiness for users relying on LLM outputs, though it is incremental as it builds on existing explainability techniques.

The paper tackles the problem of assessing faithfulness in black-box large language models by introducing a task that uses local perturbations and self-explanations to identify crucial parts for correct answers, validated on the Natural Questions dataset with demonstrated effectiveness.

This paper introduces a novel task to assess the faithfulness of large language models (LLMs) using local perturbations and self-explanations. Many LLMs often require additional context to answer certain questions correctly. For this purpose, we propose a new efficient alternative explainability technique, inspired by the commonly used leave-one-out approach. Using this approach, we identify the sufficient and necessary parts for the LLM to generate correct answers, serving as explanations. We propose a metric for assessing faithfulness that compares these crucial parts with the self-explanations of the model. Using the Natural Questions dataset, we validate our approach, demonstrating its effectiveness in explaining model decisions and assessing faithfulness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes