IRCLOct 26, 2024

UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers

arXiv:2410.20163v215 citationsh-index: 13NAACL
Originality Highly original
AI Analysis

This addresses the limitation of homogeneous retrieval models for real-world applications with diverse data types, representing a novel method rather than an incremental improvement.

The paper tackles the problem of information retrieval in heterogeneous knowledge sources by introducing UniHGKR, a unified instruction-aware retriever that builds a unified retrieval space and follows diverse user instructions, achieving up to 6.36% and 54.23% relative improvements on a new benchmark and a 5.90-point absolute improvement on a popular task.

Existing information retrieval (IR) models often assume a homogeneous structure for knowledge sources and user queries, limiting their applicability in real-world settings where retrieval is inherently heterogeneous and diverse. In this paper, we introduce UniHGKR, a unified instruction-aware heterogeneous knowledge retriever that (1) builds a unified retrieval space for heterogeneous knowledge and (2) follows diverse user instructions to retrieve knowledge of specified types. UniHGKR consists of three principal stages: heterogeneous self-supervised pretraining, text-anchored embedding alignment, and instruction-aware retriever fine-tuning, enabling it to generalize across varied retrieval contexts. This framework is highly scalable, with a BERT-based version and a UniHGKR-7B version trained on large language models. Also, we introduce CompMix-IR, the first native heterogeneous knowledge retrieval benchmark. It includes two retrieval scenarios with various instructions, over 9,400 question-answer (QA) pairs, and a corpus of 10 million entries, covering four different types of data. Extensive experiments show that UniHGKR consistently outperforms state-of-the-art methods on CompMix-IR, achieving up to 6.36% and 54.23% relative improvements in two scenarios, respectively. Finally, by equipping our retriever for open-domain heterogeneous QA systems, we achieve a new state-of-the-art result on the popular ConvMix task, with an absolute improvement of up to 5.90 points.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes