LGCRFeb 20, 2025

Interpreting Adversarial Attacks and Defences using Architectures with Enhanced Interpretability

arXiv:2502.15017v11 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This work provides insights into adversarial robustness for deep learning practitioners, but it is incremental as it applies an existing interpretable architecture to analyze known defense methods.

The paper tackled the problem of interpreting adversarial attacks and defenses by using Deep Linearly Gated Networks (DLGN) to analyze robust models trained with PGD adversarial training versus standard training, revealing that PGD-AT hyperplanes are aligned farther from data points and create diverse, non-overlapping active subnetworks across classes.

Adversarial attacks in deep learning represent a significant threat to the integrity and reliability of machine learning models. Adversarial training has been a popular defence technique against these adversarial attacks. In this work, we capitalize on a network architecture, namely Deep Linearly Gated Networks (DLGN), which has better interpretation capabilities than regular deep network architectures. Using this architecture, we interpret robust models trained using PGD adversarial training and compare them with standard training. Feature networks in DLGN act as feature extractors, making them the only medium through which an adversary can attack the model. We analyze the feature network of DLGN with fully connected layers with respect to properties like alignment of the hyperplanes, hyperplane relation with PCA, and sub-network overlap among classes and compare these properties between robust and standard models. We also consider this architecture having CNN layers wherein we qualitatively (using visualizations) and quantitatively contrast gating patterns between robust and standard models. We uncover insights into hyperplanes resembling principal components in PGD-AT and STD-TR models, with PGD-AT hyperplanes aligned farther from the data points. We use path activity analysis to show that PGD-AT models create diverse, non-overlapping active subnetworks across classes, preventing attack-induced gating overlaps. Our visualization ideas show the nature of representations learnt by PGD-AT and STD-TR models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes