CLNov 28, 2023

RETSim: Resilient and Efficient Text Similarity

arXiv:2311.17264v14 citationsh-index: 35Has Code
Originality Highly original
AI Analysis

This work addresses the need for efficient and resilient text similarity tools in tasks like dataset deduplication and adversarial text retrieval, with incremental improvements in robustness and accuracy.

The paper tackles the problem of near-duplicate text retrieval, clustering, and dataset deduplication by introducing RETSim, a lightweight multilingual deep learning model that achieves new state-of-the-art performance, being significantly more robust and accurate than MinHash and neural text embeddings.

This paper introduces RETSim (Resilient and Efficient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication tasks. We demonstrate that RETSim is significantly more robust and accurate than MinHash and neural text embeddings, achieving new state-of-the-art performance on dataset deduplication, adversarial text retrieval benchmarks, and spam clustering tasks. We also introduce the W4NT3D benchmark (Wiki-40B 4dversarial Near-T3xt Dataset) for evaluating multilingual, near-duplicate text retrieval capabilities under adversarial settings. RETSim and the W4NT3D benchmark are open-sourced under the MIT License at https://github.com/google/unisim.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes