CLJun 10, 2025

Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models

Milan Bhan, Jean-Noel Vittaut, Nicolas Chesneau, Sarath Chandar, Marie-Jeanne Lesot

arXiv:2506.09277v34.9h-index: 21

Originality Incremental advance

AI Analysis

This addresses the need for trustworthy AI systems by providing a method to evaluate and enhance the faithfulness of LLM self-explanations, which is incremental as it builds on existing evaluation methods by incorporating neural representation analysis.

The paper tackles the problem of unfaithful self-explanations in Large Language Models (LLMs), where generated explanations may not reflect actual reasoning, and proposes NeuroFaith, a framework that measures faithfulness by identifying key concepts and testing their influence on predictions, showing versatility across 2-hop reasoning and classification tasks.

Large Language Models (LLMs) can generate plausible free text self-explanations to justify their answers. However, these natural language explanations may not accurately reflect the model's actual reasoning process, indicating a lack of faithfulness. Existing faithfulness evaluation methods rely primarily on behavioral tests or computational block analysis without examining the semantic content of internal neural representations. This paper proposes NeuroFaith, a flexible framework that measures the faithfulness of LLM free text self-explanation by identifying key concepts within explanations and mechanistically testing whether these concepts actually influence the model's predictions. We show the versatility of NeuroFaith across 2-hop reasoning and classification tasks. Additionally, a linear faithfulness probe based on NeuroFaith is developed to detect unfaithful self-explanations from representation space and improve faithfulness through steering. NeuroFaith provides a principled approach to evaluating and enhancing the faithfulness of LLM free text self-explanations, addressing critical needs for trustworthy AI systems.

View on arXiv PDF

Similar