Fixed Point Explainability
This addresses the need for more reliable and stable explanations in interpretable AI, particularly for complex models like LLMs, though it appears incremental as it builds on existing explainability principles.
The paper tackles the problem of evaluating explanation stability in machine learning models by introducing fixed point explanations, which assess the interplay between models and explainers through recursive applications, and reports quantitative and qualitative results across various datasets and models including LLMs like Llama-3.3-70B.
This paper introduces a formal notion of fixed point explanations, inspired by the "why regress" principle, to assess, through recursive applications, the stability of the interplay between a model and its explainer. Fixed point explanations satisfy properties like minimality, stability, and faithfulness, revealing hidden model behaviours and explanatory weaknesses. We define convergence conditions for several classes of explainers, from feature-based to mechanistic tools like Sparse AutoEncoders, and we report quantitative and qualitative results for several datasets and models, including LLMs such as Llama-3.3-70B.