CLIROct 8, 2025

Overview of the Plagiarism Detection Task at PAN 2025

arXiv:2510.06805v1h-index: 15CLEF
Originality Synthesis-oriented
AI Analysis

This work addresses plagiarism detection for scientific articles, but it is incremental as it builds on existing tasks and datasets.

The paper tackled the problem of detecting automatically generated textual plagiarism in scientific articles by creating a novel large-scale dataset using three large language models and evaluating participant approaches, finding that naive semantic similarity methods achieved up to 0.8 recall and 0.5 precision but lacked generalizability to older datasets.

The generative plagiarism detection task at PAN 2025 aims at identifying automatically generated textual plagiarism in scientific articles and aligning them with their respective sources. We created a novel large-scale dataset of automatically generated plagiarism using three large language models: Llama, DeepSeek-R1, and Mistral. In this task overview paper, we outline the creation of this dataset, summarize and compare the results of all participants and four baselines, and evaluate the results on the last plagiarism detection task from PAN 2015 in order to interpret the robustness of the proposed approaches. We found that the current iteration does not invite a large variety of approaches as naive semantic similarity approaches based on embedding vectors provide promising results of up to 0.8 recall and 0.5 precision. In contrast, most of these approaches underperform significantly on the 2015 dataset, indicating a lack in generalizability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes