AI LGFeb 2

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

Harry Mayne, Justin Singh Kang, Dewi Gould, Kannan Ramchandran, Adam Mahdi, Noah Y. Siegel

arXiv:2602.02639v18.83 citations

Originality Incremental advance

AI Analysis

This addresses the need for better AI oversight tools by providing a scalable method to evaluate explanation faithfulness, though it is incremental as it builds on existing faithfulness metrics.

The paper tackles the problem of assessing the faithfulness of LLM self-explanations to model reasoning by introducing Normalized Simulatability Gain (NSG), a metric that measures how well explanations help predict model behavior, finding self-explanations improve prediction by 11-37% on 7,000 counterfactuals across various domains.

LLM self-explanations are often presented as a promising tool for AI oversight, yet their faithfulness to the model's true reasoning process is poorly understood. Existing faithfulness metrics have critical limitations, typically relying on identifying unfaithfulness via adversarial prompting or detecting reasoning errors. These methods overlook the predictive value of explanations. We introduce Normalized Simulatability Gain (NSG), a general and scalable metric based on the idea that a faithful explanation should allow an observer to learn a model's decision-making criteria, and thus better predict its behavior on related inputs. We evaluate 18 frontier proprietary and open-weight models, e.g., Gemini 3, GPT-5.2, and Claude 4.5, on 7,000 counterfactuals from popular datasets covering health, business, and ethics. We find self-explanations substantially improve prediction of model behavior (11-37% NSG). Self-explanations also provide more predictive information than explanations generated by external models, even when those models are stronger. This implies an advantage from self-knowledge that external explanation methods cannot replicate. Our approach also reveals that, across models, 5-15% of self-explanations are egregiously misleading. Despite their imperfections, we show a positive case for self-explanations: they encode information that helps predict model behavior.

View on arXiv PDF

Similar