Noisy Pair Corrector for Dense Retrieval
This addresses a practical issue in dense retrieval for real-world applications where automatic data collection introduces noise, offering a solution to improve model robustness.
The paper tackles the problem of training dense retrieval models with noisy query-document pairs, proposing a Noisy Pair Corrector (NPC) that detects and corrects mismatched pairs, achieving excellent performance on benchmarks like Natural Question and TriviaQA.
Most dense retrieval models contain an implicit assumption: the training query-document pairs are exactly matched. Since it is expensive to annotate the corpus manually, training pairs in real-world applications are usually collected automatically, which inevitably introduces mismatched-pair noise. In this paper, we explore an interesting and challenging problem in dense retrieval, how to train an effective model with mismatched-pair noise. To solve this problem, we propose a novel approach called Noisy Pair Corrector (NPC), which consists of a detection module and a correction module. The detection module estimates noise pairs by calculating the perplexity between annotated positive and easy negative documents. The correction module utilizes an exponential moving average (EMA) model to provide a soft supervised signal, aiding in mitigating the effects of noise. We conduct experiments on text-retrieval benchmarks Natural Question and TriviaQA, code-search benchmarks StaQC and SO-DS. Experimental results show that NPC achieves excellent performance in handling both synthetic and realistic noise.