Detecting Media Clones in Cultural Repositories Using a Positive Unlabeled Learning Approach
This work addresses the challenge of duplicate discovery in cultural heritage archives for curators, offering an incremental improvement over existing methods.
The paper tackles the problem of detecting duplicate media items in cultural repositories by formulating it as a Positive-Unlabeled learning problem, achieving an F1 score of 90.79 (AUROC=98.99) on the AtticPOT dataset, which improves F1 by +7.70 points over the best baseline.
We formulate curator-in-the-loop duplicate discovery in the AtticPOT repository as a Positive-Unlabeled (PU) learning problem. Given a single anchor per artefact, we train a lightweight per-query Clone Encoder on augmented views of the anchor and score the unlabeled repository with an interpretable threshold on the latent l_2 norm. The system proposes candidates for curator verification, uncovering cross-record duplicates that were not verified a priori. On CIFAR-10 we obtain F1=96.37 (AUROC=97.97); on AtticPOT we reach F1=90.79 (AUROC=98.99), improving F1 by +7.70 points over the best baseline (SVDD) under the same lightweight backbone. Qualitative "find-similar" panels show stable neighbourhoods across viewpoint and condition. The method avoids explicit negatives, offers a transparent operating point, and fits de-duplication, record linkage, and curator-in-the-loop workflows.