CL AIApr 8, 2024

PetKaz at SemEval-2024 Task 8: Can Linguistics Capture the Specifics of LLM-generated Text?

Kseniia Petukhova, Roman Kazakov, Ekaterina Kochmar

arXiv:2404.05483v114.628 citationsh-index: 7SemEval

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of identifying AI-generated content for applications in security and content moderation, but it is incremental as it builds on existing methods.

The paper tackled the problem of detecting machine-generated texts in English by combining RoBERTa-base embeddings with diversity features and a resampled training set, achieving 12th place out of 124 in a competition and an accuracy of 0.91.

In this paper, we present our submission to the SemEval-2024 Task 8 "Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection", focusing on the detection of machine-generated texts (MGTs) in English. Specifically, our approach relies on combining embeddings from the RoBERTa-base with diversity features and uses a resampled training set. We score 12th from 124 in the ranking for Subtask A (monolingual track), and our results show that our approach is generalizable across unseen models and domains, achieving an accuracy of 0.91.

View on arXiv PDF

Similar