CLAIFeb 3, 2023

TextShield: Beyond Successfully Detecting Adversarial Sentences in Text Classification

arXiv:2302.02023v111 citationsh-index: 13
Originality Highly original
AI Analysis

This addresses a major challenge for deploying neural networks in safety-critical NLP applications by extending detection-based defense to include correction, filling a gap in existing methods.

The paper tackles the problem of adversarial attacks in text classification by proposing TextShield, which detects and corrects adversarial sentences, achieving higher or comparable performance to state-of-the-art defenses across various attacks and benchmarks.

Adversarial attack serves as a major challenge for neural network models in NLP, which precludes the model's deployment in safety-critical applications. A recent line of work, detection-based defense, aims to distinguish adversarial sentences from benign ones. However, {the core limitation of previous detection methods is being incapable of giving correct predictions on adversarial sentences unlike defense methods from other paradigms.} To solve this issue, this paper proposes TextShield: (1) we discover a link between text attack and saliency information, and then we propose a saliency-based detector, which can effectively detect whether an input sentence is adversarial or not. (2) We design a saliency-based corrector, which converts the detected adversary sentences to benign ones. By combining the saliency-based detector and corrector, TextShield extends the detection-only paradigm to a detection-correction paradigm, thus filling the gap in the existing detection-based defense. Comprehensive experiments show that (a) TextShield consistently achieves higher or comparable performance than state-of-the-art defense methods across various attacks on different benchmarks. (b) our saliency-based detector outperforms existing detectors for detecting adversarial sentences.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes