CLAIMay 30, 2025

Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors

arXiv:2505.24523v110 citationsh-index: 30Has CodeACL
Originality Incremental advance
AI Analysis

This addresses the challenge of robust machine-generated text detection for preventing misinformation, but it is incremental as it focuses on testing existing methods rather than proposing new ones.

The authors tackled the problem of detecting machine-generated text by testing state-of-the-art detectors against adversarial attacks that shift the writing style of language models to mimic human text, resulting in a significant drop in detection performance.

Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the alignment and which features are used by detectors to detect MGT texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detection performance. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes