Remining Hard Negatives for Generative Pseudo Labeled Domain Adaptation
This work addresses domain adaptation for neural information retrieval, offering incremental improvements to existing techniques for researchers and practitioners in retrieval systems.
The paper tackled the problem of dense retrievers lacking robustness to domain shifts in zero-shot settings by proposing a method to refresh hard negatives during knowledge distillation, which improved ranking performance on 13 out of 14 BEIR datasets and 9 out of 12 LoTTe datasets.
Dense retrievers have demonstrated significant potential for neural information retrieval; however, they exhibit a lack of robustness to domain shifts, thereby limiting their efficacy in zero-shot settings across diverse domains. A state-of-the-art domain adaptation technique is Generative Pseudo Labeling (GPL). GPL uses synthetic query generation and initially mined hard negatives to distill knowledge from cross-encoder to dense retrievers in the target domain. In this paper, we analyze the documents retrieved by the domain-adapted model and discover that these are more relevant to the target queries than those of the non-domain-adapted model. We then propose refreshing the hard-negative index during the knowledge distillation phase to mine better hard negatives. Our remining R-GPL approach boosts ranking performance in 13/14 BEIR datasets and 9/12 LoTTe datasets. Our contributions are (i) analyzing hard negatives returned by domain-adapted and non-domain-adapted models and (ii) applying the GPL training with and without hard-negative re-mining in LoTTE and BEIR datasets.