RETSim: Resilient and Efficient Text Similarity
This work addresses the need for efficient and resilient text similarity tools in tasks like dataset deduplication and adversarial text retrieval, with incremental improvements in robustness and accuracy.
The paper tackles the problem of near-duplicate text retrieval, clustering, and dataset deduplication by introducing RETSim, a lightweight multilingual deep learning model that achieves new state-of-the-art performance, being significantly more robust and accurate than MinHash and neural text embeddings.
This paper introduces RETSim (Resilient and Efficient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication tasks. We demonstrate that RETSim is significantly more robust and accurate than MinHash and neural text embeddings, achieving new state-of-the-art performance on dataset deduplication, adversarial text retrieval benchmarks, and spam clustering tasks. We also introduce the W4NT3D benchmark (Wiki-40B 4dversarial Near-T3xt Dataset) for evaluating multilingual, near-duplicate text retrieval capabilities under adversarial settings. RETSim and the W4NT3D benchmark are open-sourced under the MIT License at https://github.com/google/unisim.