Robust Models Are More Interpretable Because Attributions Look Normal
This work addresses the challenge of making deep learning models more interpretable for researchers and practitioners, offering a novel method to improve attribution quality, though it is incremental in building on existing robust model insights.
The paper tackles the problem of improving interpretability in deep learning models by showing that adversarially-robust models have smoother decision boundaries, leading to sharper and more concentrated feature attributions. It introduces boundary attributions, a method that aggregates normal vectors of local decision boundaries to produce enhanced visual explanations, even on non-robust models.
Recent work has found that adversarially-robust deep networks used for image classification are more interpretable: their feature attributions tend to be sharper, and are more concentrated on the objects associated with the image's ground-truth class. We show that smooth decision boundaries play an important role in this enhanced interpretability, as the model's input gradients around data points will more closely align with boundaries' normal vectors when they are smooth. Thus, because robust models have smoother boundaries, the results of gradient-based attribution methods, like Integrated Gradients and DeepLift, will capture more accurate information about nearby decision boundaries. This understanding of robust interpretability leads to our second contribution: \emph{boundary attributions}, which aggregate information about the normal vectors of local decision boundaries to explain a classification outcome. We show that by leveraging the key factors underpinning robust interpretability, boundary attributions produce sharper, more concentrated visual explanations -- even on non-robust models. Any example implementation can be found at \url{https://github.com/zifanw/boundary}.