AISep 24, 2024

Unsupervised Text Representation Learning via Instruction-Tuning for Zero-Shot Dense Retrieval

Qiuhai Zeng, Zimeng Qiu, Dae Yon Hwang, Xin He, William M. Campbell

arXiv:2409.16497v120.923 citationsh-index: 35

Originality Incremental advance

AI Analysis

This addresses the data scarcity issue in information retrieval for low-resource settings, though it is incremental as it builds on existing LLM and retrieval frameworks.

The paper tackles the problem of costly supervised data for dense retrieval by introducing an unsupervised text representation learning technique via instruction-tuning LLMs, achieving significant zero-shot retrieval performance improvements, such as increasing FLAN-T5 models by 3.34-3.50% and outperforming competitive retrievers by up to 9.52% on NDCG@10 with smaller models.

Dense retrieval systems are commonly used for information retrieval (IR). They rely on learning text representations through an encoder and usually require supervised modeling via labelled data which can be costly to obtain or simply unavailable. In this study, we introduce a novel unsupervised text representation learning technique via instruction-tuning the pre-trained encoder-decoder large language models (LLM) under the dual-encoder retrieval framework. We demonstrate the corpus representation can be augmented by the representations of relevant synthetic queries generated by the instruct-tuned LLM founded on the Rao-Blackwell theorem. Furthermore, we effectively align the query and corpus text representation with self-instructed-tuning. Specifically, we first prompt an open-box pre-trained LLM to follow defined instructions (i.e. question generation and keyword summarization) to generate synthetic queries. Next, we fine-tune the pre-trained LLM with defined instructions and the generated queries that passed quality check. Finally, we generate synthetic queries with the instruction-tuned LLM for each corpora and represent each corpora by weighted averaging the synthetic queries and original corpora embeddings. We evaluate our proposed method under low-resource settings on three English and one German retrieval datasets measuring NDCG@10, MRR@100, Recall@100. We significantly improve the average zero-shot retrieval performance on all metrics, increasing open-box FLAN-T5 model variations by [3.34%, 3.50%] in absolute and exceeding three competitive dense retrievers (i.e. mDPR, T-Systems, mBART-Large), with model of size at least 38% smaller, by 1.96%, 4.62%, 9.52% absolute on NDCG@10.

View on arXiv PDF

Similar