Discourse Features Enhance Detection of Document-Level Machine-Generated Content
This work addresses the issue of academic plagiarism and misinformation spread for users relying on MGC detection, but it is incremental as it builds on existing methods by adding discourse features.
The paper tackles the problem of detecting machine-generated content (MGC) at the document level, which is challenging due to reliance on surface-level features and susceptibility to paraphrasing, by introducing DTransformer, a model that integrates discourse analysis, resulting in substantial performance gains such as a 15.5% absolute improvement on the paraLFQA dataset compared to state-of-the-art approaches.
The availability of high-quality APIs for Large Language Models (LLMs) has facilitated the widespread creation of Machine-Generated Content (MGC), posing challenges such as academic plagiarism and the spread of misinformation. Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features. This makes them susceptible to deception by surface-level sentence patterns, particularly for longer texts and in texts that have been subsequently paraphrased. To overcome these challenges, we introduce novel methodologies and datasets. Besides the publicly available dataset Plagbench, we developed the paraphrased Long-Form Question and Answer (paraLFQA) and paraphrased Writing Prompts (paraWP) datasets using GPT and DIPPER, a discourse paraphrasing tool, by extending artifacts from their original versions. To better capture the structure of longer texts at document level, we propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features. It results in substantial performance gains across both datasets - 15.5% absolute improvement on paraLFQA, 4% absolute improvement on paraWP, and 1.5% absolute improvemene on M4 compared to SOTA approaches. The data and code are available at: https://github.com/myxp-lyp/Discourse-Features-Enhance-Detection-of-Document-Level-Machine-Generated-Content.git.