CLCRLGOct 16, 2020

Mischief: A Simple Black-Box Attack Against Transformer Architectures

arXiv:2010.08542v1
Originality Incremental advance
AI Analysis

This addresses the vulnerability of transformer architectures to adversarial attacks, which is an incremental but important security concern for NLP applications.

The paper tackles the problem of adversarial attacks on transformer-based language models by introducing Mischief, a simple black-box method that generates human-readable adversarial examples, which degrade model performance by up to 20% on test sets but can be mitigated by including such examples in training, sometimes even improving baseline performance.

We introduce Mischief, a simple and lightweight method to produce a class of human-readable, realistic adversarial examples for language models. We perform exhaustive experimentations of our algorithm on four transformer-based architectures, across a variety of downstream tasks, as well as under varying concentrations of said examples. Our findings show that the presence of Mischief-generated adversarial samples in the test set significantly degrades (by up to $20\%$) the performance of these models with respect to their reported baselines. Nonetheless, we also demonstrate that, by including similar examples in the training set, it is possible to restore the baseline scores on the adversarial test set. Moreover, for certain tasks, the models trained with Mischief set show a modest increase on performance with respect to their original, non-adversarial baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes