EXAGREE: Mitigating Explanation Disagreement with Stakeholder-Aligned Models
This addresses the issue of explanation disagreement for stakeholders in safety-critical domains, representing an incremental improvement by integrating existing concepts into a novel framework.
The paper tackles the problem of conflicting explanations in machine learning models for safety-critical domains by introducing EXAGREE, a framework that selects stakeholder-aligned explanation models, resulting in simultaneous gains in faithfulness, plausibility, and fairness across six real-world datasets while preserving task accuracy.
Conflicting explanations, arising from different attribution methods or model internals, limit the adoption of machine learning models in safety-critical domains. We turn this disagreement into an advantage and introduce EXplanation AGREEment (EXAGREE), a two-stage framework that selects a Stakeholder-Aligned Explanation Model (SAEM) from a set of similar-performing models. The selection maximizes Stakeholder-Machine Agreement (SMA), a single metric that unifies faithfulness and plausibility. EXAGREE couples a differentiable mask-based attribution network (DMAN) with monotone differentiable sorting, enabling gradient-based search inside the constrained model space. Experiments on six real-world datasets demonstrate simultaneous gains of faithfulness, plausibility, and fairness over baselines, while preserving task accuracy. Extensive ablation studies, significance tests, and case studies confirm the robustness and feasibility of the method in practice.