Character 3-gram Mover's Distance: An Effective Method for Detecting Near-duplicate Japanese-language Recipes
This addresses the issue of recipe plagiarism or redundancy for users and platforms, but it is incremental as it adapts an existing method to a specific domain.
The study tackled the problem of detecting near-duplicate Japanese-language recipes on user-generated websites by proposing a method that extends Word Mover's Distance to character 3-gram embedding, and it successfully detected near-duplicate recipes that a comparison method missed.
In user-generated recipe websites, users post their-original recipes. Some recipes, however, are very similar in major components such as the cooking instructions to other recipes. We refer to such recipes as "near-duplicate recipes". In this study, we propose a method that extends the "Word Mover's Distance", which calculates distances between texts based on word embedding, to character 3-gram embedding. Using a corpus of over 1.21 million recipes, we learned the word embedding and the character 3-gram embedding by using a Skip-Gram model with negative sampling and fastText to extract candidate pairs of near-duplicate recipes. We then annotated these candidates and evaluated the proposed method against a comparison method. Our results demonstrated that near-duplicate recipes that were not detected by the comparison method were successfully detected by the proposed method.