CLAIJan 22, 2024

Fine-tuning Large Language Models for Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection

arXiv:2401.12326v18 citationsh-index: 5
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of identifying AI-generated content for applications in security and content moderation, but it is incremental as it applies existing fine-tuning techniques to a new benchmark task.

The paper tackled the problem of detecting machine-generated texts from diverse LLMs across languages and domains in SemEval-2024 Task 8, focusing on binary and multi-class classification subtasks, and found that fine-tuned transformer models like LoRA-RoBERTa outperformed traditional ML methods, with majority voting enhancing performance in multilingual contexts.

SemEval-2024 Task 8 introduces the challenge of identifying machine-generated texts from diverse Large Language Models (LLMs) in various languages and domains. The task comprises three subtasks: binary classification in monolingual and multilingual (Subtask A), multi-class classification (Subtask B), and mixed text detection (Subtask C). This paper focuses on Subtask A & B. Each subtask is supported by three datasets for training, development, and testing. To tackle this task, two methods: 1) using traditional machine learning (ML) with natural language preprocessing (NLP) for feature extraction, and 2) fine-tuning LLMs for text classification. The results show that transformer models, particularly LoRA-RoBERTa, exceed traditional ML methods in effectiveness, with majority voting being particularly effective in multilingual contexts for identifying machine-generated texts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes