LG MLSep 10, 2019

Learning to Disentangle Robust and Vulnerable Features for Adversarial Detection

Byunggill Joe, Sung Ju Hwang, Insik Shin

arXiv:1909.04311v16.05 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of effective adversarial defense in deep neural networks, particularly against whitebox attacks, though it is incremental as it builds on existing detection and robust inference approaches.

The paper tackled the problem of defending against whitebox adversarial attacks by hypothesizing that adversarial inputs are tied to vulnerable latent features, and proposed a method to disentangle robust and vulnerable features using variational autoencoders, resulting in a detector that adversarial inputs cannot bypass without changing semantics.

Although deep neural networks have shown promising performances on various tasks, even achieving human-level performance on some, they are shown to be susceptible to incorrect predictions even with imperceptibly small perturbations to an input. There exists a large number of previous works which proposed to defend against such adversarial attacks either by robust inference or detection of adversarial inputs. Yet, most of them cannot effectively defend against whitebox attacks where an adversary has a knowledge of the model and defense. More importantly, they do not provide a convincing reason why the generated adversarial inputs successfully fool the target models. To address these shortcomings of the existing approaches, we hypothesize that the adversarial inputs are tied to latent features that are susceptible to adversarial perturbation, which we call vulnerable features. Then based on this intuition, we propose a minimax game formulation to disentangle the latent features of each instance into robust and vulnerable ones, using variational autoencoders with two latent spaces. We thoroughly validate our model for both blackbox and whitebox attacks on MNIST, Fashion MNIST5, and Cat & Dog datasets, whose results show that the adversarial inputs cannot bypass our detector without changing its semantics, in which case the attack has failed.

View on arXiv PDF

Similar