CLJul 21, 2023

OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples

arXiv:2307.11729v3148 citationsh-index: 32
Originality Highly original
AI Analysis

This addresses the risk of LLM misuse by enhancing detection robustness against adversarial attacks, particularly in domains like student essays, though it is incremental as it builds on existing detection methods.

The paper tackles the problem of detecting LLM-generated texts, which is challenging due to attacks like paraphrasing, by proposing OUTFOX, a framework that improves detector robustness through adversarial in-context learning, resulting in up to +41.3 points F1-score improvement on attacked texts and up to 96.9 points F1-score on non-attacked texts.

Large Language Models (LLMs) have achieved human-level fluency in text generation, making it difficult to distinguish between human-written and LLM-generated texts. This poses a growing risk of misuse of LLMs and demands the development of detectors to identify LLM-generated texts. However, existing detectors lack robustness against attacks: they degrade detection accuracy by simply paraphrasing LLM-generated texts. Furthermore, a malicious user might attempt to deliberately evade the detectors based on detection results, but this has not been assumed in previous studies. In this paper, we propose OUTFOX, a framework that improves the robustness of LLM-generated-text detectors by allowing both the detector and the attacker to consider each other's output. In this framework, the attacker uses the detector's prediction labels as examples for in-context learning and adversarially generates essays that are harder to detect, while the detector uses the adversarially generated essays as examples for in-context learning to learn to detect essays from a strong attacker. Experiments in the domain of student essays show that the proposed detector improves the detection performance on the attacker-generated texts by up to +41.3 points F1-score. Furthermore, the proposed detector shows a state-of-the-art detection performance: up to 96.9 points F1-score, beating existing detectors on non-attacked texts. Finally, the proposed attacker drastically degrades the performance of detectors by up to -57.0 points F1-score, massively outperforming the baseline paraphrasing method for evading detection.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes