HC3 Plus: A Semantic-Invariant Human ChatGPT Comparison Corpus
This addresses the problem of AI-generated text detection for researchers and practitioners, but it is incremental as it builds on existing datasets and methods.
The paper tackles the challenge of detecting AI-generated content, particularly from ChatGPT, in semantic-invariant tasks like summarization and translation, by introducing a more extensive dataset and exploring instruction fine-tuning models for detection.
ChatGPT has garnered significant interest due to its impressive performance; however, there is growing concern about its potential risks, particularly in the detection of AI-generated content (AIGC), which is often challenging for untrained individuals to identify. Current datasets used for detecting ChatGPT-generated text primarily focus on question-answering tasks, often overlooking tasks with semantic-invariant properties, such as summarization, translation, and paraphrasing. In this paper, we demonstrate that detecting model-generated text in semantic-invariant tasks is more challenging. To address this gap, we introduce a more extensive and comprehensive dataset that incorporates a wider range of tasks than previous work, including those with semantic-invariant properties. In addition, instruction fine-tuning has demonstrated superior performance across various tasks. In this paper, we explore the use of instruction fine-tuning models for detecting text generated by ChatGPT.