IR AI CL DLFeb 2, 2024

Detection of tortured phrases in scientific literature

Eléna Martel, Martin Lentschat, Cyril Labbé

arXiv:2402.03370v137.0124 citationsh-index: 4WIESP

Originality Incremental advance

AI Analysis

This addresses the issue of identifying manipulated text for researchers and publishers, but it is incremental as it builds on existing language models and detection strategies.

The paper tackled the problem of detecting tortured phrases in scientific literature, which result from paraphrasing tools used to evade plagiarism detection, and found that a token prediction method achieved a recall of 0.87 and precision of 0.61 for retrieving new phrases.

This paper presents various automatic detection methods to extract so called tortured phrases from scientific papers. These tortured phrases, e.g. flag to clamor instead of signal to noise, are the results of paraphrasing tools used to escape plagiarism detection. We built a dataset and evaluated several strategies to flag previously undocumented tortured phrases. The proposed and tested methods are based on language models and either on embeddings similarities or on predictions of masked token. We found that an approach using token prediction and that propagates the scores to the chunk level gives the best results. With a recall value of .87 and a precision value of .61, it could retrieve new tortured phrases to be submitted to domain experts for validation.

View on arXiv PDF

Similar