LGDec 16, 2022

Robust Explanation Constraints for Neural Networks

Matthew Wicker, Juyeon Heo, Luca Costabello, Adrian Weller

Cambridge

arXiv:2212.08507v118.528 citationsh-index: 49Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of unreliable explanations in neural networks for users who rely on them for trust and insights, representing an incremental improvement over prior heuristic methods.

The paper tackles the fragility of post-hoc explanation methods for neural networks by developing a method that formally certifies the robustness of gradient-based explanations against bounded adversarial manipulations of inputs or parameters, and empirically shows it surpasses previous heuristic approaches and is the only method to achieve certified robustness across all six datasets tested.

Post-hoc explanation methods are used with the intent of providing insights about neural networks and are sometimes said to help engender trust in their outputs. However, popular explanations methods have been found to be fragile to minor perturbations of input features or model parameters. Relying on constraint relaxation techniques from non-convex optimization, we develop a method that upper-bounds the largest change an adversary can make to a gradient-based explanation via bounded manipulation of either the input features or model parameters. By propagating a compact input or parameter set as symbolic intervals through the forwards and backwards computations of the neural network we can formally certify the robustness of gradient-based explanations. Our bounds are differentiable, hence we can incorporate provable explanation robustness into neural network training. Empirically, our method surpasses the robustness provided by previous heuristic approaches. We find that our training method is the only method able to learn neural networks with certificates of explanation robustness across all six datasets tested.

View on arXiv PDF Code

Similar