CL AIOct 13, 2023

Unsupervised Domain Adaption for Neural Information Retrieval

Carlos Dominguez, Jon Ander Campos, Eneko Agirre, Gorka Azkune

arXiv:2310.09350v10.5h-index: 14

Originality Incremental advance

AI Analysis

This addresses the data annotation bottleneck for neural IR practitioners, offering an incremental improvement in domain adaptation methods.

The paper tackled the problem of costly annotated data for neural information retrieval by comparing synthetic annotation methods using Large Language Models and rule-based approaches, finding that LLMs outperform rule-based methods by a large margin and unsupervised domain adaptation is effective compared to zero-shot application.

Neural information retrieval requires costly annotated data for each target domain to be competitive. Synthetic annotation by query generation using Large Language Models or rule-based string manipulation has been proposed as an alternative, but their relative merits have not been analysed. In this paper, we compare both methods head-to-head using the same neural IR architecture. We focus on the BEIR benchmark, which includes test datasets from several domains with no training data, and explore two scenarios: zero-shot, where the supervised system is trained in a large out-of-domain dataset (MS-MARCO); and unsupervised domain adaptation, where, in addition to MS-MARCO, the system is fine-tuned in synthetic data from the target domain. Our results indicate that Large Language Models outperform rule-based methods in all scenarios by a large margin, and, more importantly, that unsupervised domain adaptation is effective compared to applying a supervised IR system in a zero-shot fashion. In addition we explore several sizes of open Large Language Models to generate synthetic data and find that a medium-sized model suffices. Code and models are publicly available for reproducibility.

View on arXiv PDF

Similar