CLIRNov 8, 2022

Unsupervised Domain Adaptation for Sparse Retrieval by Filling Vocabulary and Word Frequency Gaps

arXiv:2211.03988v1298 citationsh-index: 32
Originality Incremental advance
AI Analysis

This addresses domain shift issues in information retrieval for scenarios with scarce target-domain supervision, though it is incremental as it builds on existing SPLADE and adaptation methods.

The paper tackles the problem of domain adaptation for sparse retrieval models like SPLADE, which struggle with vocabulary and word-frequency gaps in unsupervised settings, and shows that their method outperforms state-of-the-art domain adaptation techniques and achieves top results when combined with BM25.

IR models using a pretrained language model significantly outperform lexical approaches like BM25. In particular, SPLADE, which encodes texts to sparse vectors, is an effective model for practical use because it shows robustness to out-of-domain datasets. However, SPLADE still struggles with exact matching of low-frequency words in training data. In addition, domain shifts in vocabulary and word frequencies deteriorate the IR performance of SPLADE. Because supervision data are scarce in the target domain, addressing the domain shifts without supervision data is necessary. This paper proposes an unsupervised domain adaptation method by filling vocabulary and word-frequency gaps. First, we expand a vocabulary and execute continual pretraining with a masked language model on a corpus of the target domain. Then, we multiply SPLADE-encoded sparse vectors by inverse document frequency weights to consider the importance of documents with lowfrequency words. We conducted experiments using our method on datasets with a large vocabulary gap from a source domain. We show that our method outperforms the present stateof-the-art domain adaptation method. In addition, our method achieves state-of-the-art results, combined with BM25.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes