IRCLMay 23, 2023

DAPR: A Benchmark on Document-Aware Passage Retrieval

arXiv:2305.13915v431 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This work addresses a specific retrieval challenge for users needing to find passages in long documents like Wikipedia or research papers, but it is incremental as it builds on existing methods and benchmarks.

The authors tackled the problem of retrieving relevant passages within long documents, identifying that 53.5% of errors in state-of-the-art retrievers stem from missing document context. They proposed the Document-Aware Passage Retrieval (DAPR) task, built a benchmark with multiple datasets, and found that hybrid retrieval fails on hard queries while contextualized representations show limited overall improvement.

The work of neural retrieval so far focuses on ranking short texts and is challenged with long documents. There are many cases where the users want to find a relevant passage within a long document from a huge corpus, e.g. Wikipedia articles, research papers, etc. We propose and name this task \emph{Document-Aware Passage Retrieval} (DAPR). While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5\%) are due to missing document context. This drives us to build a benchmark for this task including multiple datasets from heterogeneous domains. In the experiments, we extend the SoTA passage retrievers with document context via (1) hybrid retrieval with BM25 and (2) contextualized passage representations, which inform the passage representation with document context. We find despite that hybrid retrieval performs the strongest on the mixture of the easy and the hard queries, it completely fails on the hard queries that require document-context understanding. On the other hand, contextualized passage representations (e.g. prepending document titles) achieve good improvement on these hard queries, but overall they also perform rather poorly. Our created benchmark enables future research on developing and comparing retrieval systems for the new task. The code and the data are available at https://github.com/UKPLab/arxiv2023-dapr.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes