IR CLMar 1, 2023

UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers

Jon Saad-Falcon, Omar Khattab, Keshav Santhanam, Radu Florian, Martin Franz, Salim Roukos, Avirup Sil, Md Arafat Sultan, Christopher Potts

IBM

arXiv:2303.00807v338.6151 citationsh-index: 58Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of domain shifts in retrieval tasks for applications lacking labeled datasets, though it is incremental as it builds on existing LLM and distillation techniques.

The paper tackles the problem of domain adaptation in information retrieval without labeled data by using LLMs to generate synthetic queries for fine-tuning rerankers, which are then distilled into an efficient retriever, resulting in improved zero-shot accuracy in long-tail domains and lower latency.

Many information retrieval tasks require large labeled datasets for fine-tuning. However, such datasets are often unavailable, and their utility for real-world applications can diminish quickly due to domain shifts. To address this challenge, we develop and motivate a method for using large language models (LLMs) to generate large numbers of synthetic queries cheaply. The method begins by generating a small number of synthetic queries using an expensive LLM. After that, a much less expensive one is used to create large numbers of synthetic queries, which are used to fine-tune a family of reranker models. These rerankers are then distilled into a single efficient retriever for use in the target domain. We show that this technique boosts zero-shot accuracy in long-tail domains and achieves substantially lower latency than standard reranking methods.

View on arXiv PDF Code

Similar