LGJan 31, 2023

Interpreting Robustness Proofs of Deep Neural Networks

arXiv:2301.13845v16 citationsh-index: 8
AI Analysis

This addresses the interpretability gap in formal verification methods for deep neural networks, which is an incremental improvement for researchers and practitioners in AI safety and verification.

The paper tackles the problem of interpreting formal robustness proofs for deep neural networks, showing that standard DNN proofs rely on spurious features, while provably robust DNNs filter out meaningful features, and combined training methods are most effective at using human-understandable features.

In recent years numerous methods have been developed to formally verify the robustness of deep neural networks (DNNs). Though the proposed techniques are effective in providing mathematical guarantees about the DNNs behavior, it is not clear whether the proofs generated by these methods are human-interpretable. In this paper, we bridge this gap by developing new concepts, algorithms, and representations to generate human understandable interpretations of the proofs. Leveraging the proposed method, we show that the robustness proofs of standard DNNs rely on spurious input features, while the proofs of DNNs trained to be provably robust filter out even the semantically meaningful features. The proofs for the DNNs combining adversarial and provably robust training are the most effective at selectively filtering out spurious features as well as relying on human-understandable input features.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes