Summarization-Based Document IDs for Generative Retrieval with Language Models
This addresses the challenge of creating effective document IDs for generative retrieval systems, offering a novel approach that enhances retrieval performance, though it is incremental as it builds on existing generative retrieval frameworks.
The paper tackles the problem of generating document identifiers for generative retrieval by introducing summarization-based IDs, such as abstractive keyphrases, which improve top-10 and top-20 recall by up to 15.6% and 14.4% relative to baseline methods on retrieval tasks.
Generative retrieval (Wang et al., 2022; Tay et al., 2022) is a popular approach for end-to-end document retrieval that directly generates document identifiers given an input query. We introduce summarization-based document IDs, in which each document's ID is composed of an extractive summary or abstractive keyphrases generated by a language model, rather than an integer ID sequence or bags of n-grams as proposed in past work. We find that abstractive, content-based IDs (ACID) and an ID based on the first 30 tokens are very effective in direct comparisons with previous approaches to ID creation. We show that using ACID improves top-10 and top-20 recall by 15.6% and 14.4% (relative) respectively versus the cluster-based integer ID baseline on the MSMARCO 100k retrieval task, and 9.8% and 9.9% respectively on the Wikipedia-based NQ 100k retrieval task. Our results demonstrate the effectiveness of human-readable, natural-language IDs created through summarization for generative retrieval. We also observed that extractive IDs outperformed abstractive IDs on Wikipedia articles in NQ but not the snippets in MSMARCO, which suggests that document characteristics affect generative retrieval performance.