PetKaz at SemEval-2024 Task 8: Can Linguistics Capture the Specifics of LLM-generated Text?
This work addresses the challenge of identifying AI-generated content for applications in security and content moderation, but it is incremental as it builds on existing methods.
The paper tackled the problem of detecting machine-generated texts in English by combining RoBERTa-base embeddings with diversity features and a resampled training set, achieving 12th place out of 124 in a competition and an accuracy of 0.91.
In this paper, we present our submission to the SemEval-2024 Task 8 "Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection", focusing on the detection of machine-generated texts (MGTs) in English. Specifically, our approach relies on combining embeddings from the RoBERTa-base with diversity features and uses a resampled training set. We score 12th from 124 in the ranking for Subtask A (monolingual track), and our results show that our approach is generalizable across unseen models and domains, achieving an accuracy of 0.91.