Mutation-Based Adversarial Attacks on Neural Text Detectors
This addresses the robustness of text detectors against adversarial attacks, which is an incremental improvement in security for AI systems.
The paper tackles the problem of challenging neural text detectors by proposing character- and word-based mutation operators to generate adversarial samples, resulting in decreased prediction accuracy for state-of-the-art detectors.
Neural text detectors aim to decide the characteristics that distinguish neural (machine-generated) from human texts. To challenge such detectors, adversarial attacks can alter the statistical characteristics of the generated text, making the detection task more and more difficult. Inspired by the advances of mutation analysis in software development and testing, in this paper, we propose character- and word-based mutation operators for generating adversarial samples to attack state-of-the-art natural text detectors. This falls under white-box adversarial attacks. In such attacks, attackers have access to the original text and create mutation instances based on this original text. The ultimate goal is to confuse machine learning models and classifiers and decrease their prediction accuracy.