VoteTRANS: Detecting Adversarial Text without Training by Voting on Hard Labels of Transformations
This addresses the vulnerability of deep learning models to adversarial attacks in text, offering a training-free detection solution.
The paper tackles the problem of detecting adversarial text attacks that preserve meaning and evade human detection, proposing VoteTRANS, a method that detects such attacks without training by voting on hard labels from transformations, achieving effective detection across various attacks, models, and datasets.
Adversarial attacks reveal serious flaws in deep learning models. More dangerously, these attacks preserve the original meaning and escape human recognition. Existing methods for detecting these attacks need to be trained using original/adversarial data. In this paper, we propose detection without training by voting on hard labels from predictions of transformations, namely, VoteTRANS. Specifically, VoteTRANS detects adversarial text by comparing the hard labels of input text and its transformation. The evaluation demonstrates that VoteTRANS effectively detects adversarial text across various state-of-the-art attacks, models, and datasets.