IR CL LGSep 10, 2021

Query-driven Segment Selection for Ranking Long Documents

Youngwoo Kim, Razieh Rahimi, Hamed Bonab, James Allan

arXiv:2109.04611v15 citations

Originality Incremental advance

AI Analysis

This addresses the inefficiency in ranking long documents for information retrieval, offering an incremental improvement over existing approaches.

The paper tackles the problem of training transformer-based rankers on long documents by proposing a query-driven segment selection method to replace heuristic segment selection, resulting in a BERT-based ranker that significantly outperforms heuristic methods and matches state-of-the-art models with localized self-attention.

Transformer-based rankers have shown state-of-the-art performance. However, their self-attention operation is mostly unable to process long sequences. One of the common approaches to train these rankers is to heuristically select some segments of each document, such as the first segment, as training data. However, these segments may not contain the query-related parts of documents. To address this problem, we propose query-driven segment selection from long documents to build training data. The segment selector provides relevant samples with more accurate labels and non-relevant samples which are harder to be predicted. The experimental results show that the basic BERT-based ranker trained with the proposed segment selector significantly outperforms that trained by the heuristically selected segments, and performs equally to the state-of-the-art model with localized self-attention that can process longer input sequences. Our findings open up new direction to design efficient transformer-based rankers.

View on arXiv PDF

Similar