CLDLNov 18, 2021

Detecting Cross-Language Plagiarism using Open Knowledge Graphs

arXiv:2111.09749v21 citations
Originality Incremental advance
AI Analysis

This addresses the problem of detecting plagiarism across languages for researchers and educators, offering a scalable solution without requiring machine translation or parallel corpora, though it is incremental in improving existing methods.

The paper tackles cross-language plagiarism detection, especially for distant language pairs and sense-for-sense translations, by introducing CL-OSA, which outperforms state-of-the-art methods, improving the PlagDet score by more than a factor of two for challenging cases.

Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require computationally expensive machine translation, nor pre-training using comparable or parallel corpora. It reliably disambiguates homonyms and scales to allow its application to Web-scale document collections. We show that CL-OSA outperforms state-of-the-art methods for retrieving candidate documents from five large, topically diverse test corpora that include distant language pairs like Japanese-English. For identifying cross-language plagiarism at the character level, CL-OSA primarily improves the detection of sense-for-sense translations. For these challenging cases, CL-OSA's performance in terms of the well-established PlagDet score exceeds that of the best competitor by more than factor two. The code and data of our study are openly available.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes