IRDec 11, 2019

Character 3-gram Mover's Distance: An Effective Method for Detecting Near-duplicate Japanese-language Recipes

arXiv:1912.05171v2
Originality Synthesis-oriented
AI Analysis

This addresses the issue of recipe plagiarism or redundancy for users and platforms, but it is incremental as it adapts an existing method to a specific domain.

The study tackled the problem of detecting near-duplicate Japanese-language recipes on user-generated websites by proposing a method that extends Word Mover's Distance to character 3-gram embedding, and it successfully detected near-duplicate recipes that a comparison method missed.

In user-generated recipe websites, users post their-original recipes. Some recipes, however, are very similar in major components such as the cooking instructions to other recipes. We refer to such recipes as "near-duplicate recipes". In this study, we propose a method that extends the "Word Mover's Distance", which calculates distances between texts based on word embedding, to character 3-gram embedding. Using a corpus of over 1.21 million recipes, we learned the word embedding and the character 3-gram embedding by using a Skip-Gram model with negative sampling and fastText to extract candidate pairs of near-duplicate recipes. We then annotated these candidates and evaluated the proposed method against a comparison method. Our results demonstrated that near-duplicate recipes that were not detected by the comparison method were successfully detected by the proposed method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes