Towards White Box Deep Learning
This addresses the interpretability and robustness issues in deep learning, which is crucial for deploying AI in safety-critical domains, though it appears to be a proof-of-concept rather than a broad SOTA advancement.
The paper tackles the problem of deep neural networks being black boxes and vulnerable to adversarial attacks by proposing semantic features as an architectural solution, resulting in a lightweight, interpretable network that achieves near-human-level adversarial test metrics without adversarial training.
Deep neural networks learn fragile "shortcut" features, rendering them difficult to interpret (black box) and vulnerable to adversarial attacks. This paper proposes semantic features as a general architectural solution to this problem. The main idea is to make features locality-sensitive in the adequate semantic topology of the domain, thus introducing a strong regularization. The proof of concept network is lightweight, inherently interpretable and achieves almost human-level adversarial test metrics - with no adversarial training! These results and the general nature of the approach warrant further research on semantic features. The code is available at https://github.com/314-Foundation/white-box-nn