CLJun 24, 2024

Investigating the Influence of Prompt-Specific Shortcuts in AI Generated Text Detection

arXiv:2406.16275v1Has Code
Originality Incremental advance
AI Analysis

This work addresses a critical generalization problem in AIGT detection for AI safety and content moderation, but it is incremental as it builds on existing adversarial attack methods.

The paper investigates how limited prompt variation in AI Generated Text (AIGT) detection datasets introduces prompt-specific shortcuts that harm generalization, and proposes FAILOpt, an attack that exploits these shortcuts to drop detection performance by amounts comparable to other attacks, while also using it to augment training and improve robustness across models, tasks, and attacks.

AI Generated Text (AIGT) detectors are developed with texts from humans and LLMs of common tasks. Despite the diversity of plausible prompt choices, these datasets are generally constructed with a limited number of prompts. The lack of prompt variation can introduce prompt-specific shortcut features that exist in data collected with the chosen prompt, but do not generalize to others. In this paper, we analyze the impact of such shortcuts in AIGT detection. We propose Feedback-based Adversarial Instruction List Optimization (FAILOpt), an attack that searches for instructions deceptive to AIGT detectors exploiting prompt-specific shortcuts. FAILOpt effectively drops the detection performance of the target detector, comparable to other attacks based on adversarial in-context examples. We also utilize our method to enhance the robustness of the detector by mitigating the shortcuts. Based on the findings, we further train the classifier with the dataset augmented by FAILOpt prompt. The augmented classifier exhibits improvements across generation models, tasks, and attacks. Our code will be available at https://github.com/zxcvvxcz/FAILOpt.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes