CLAIAug 16, 2022

ConTextual Masked Auto-Encoder for Dense Passage Retrieval

arXiv:2208.07670v334 citationsh-index: 32Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of retrieving relevant passages from large corpora for queries, which is incremental as it builds on existing pre-trained language models for dense retrieval.

The paper tackles the problem of dense passage retrieval by proposing CoT-MAE, a generative pre-training method that improves retrieval performance through self-supervised and context-supervised masked auto-encoding, showing considerable improvements over strong baselines on large-scale benchmarks.

Dense passage retrieval aims to retrieve the relevant passages of a query from a large corpus based on dense representations (i.e., vectors) of the query and the passages. Recent studies have explored improving pre-trained language models to boost dense retrieval performance. This paper proposes CoT-MAE (ConTextual Masked Auto-Encoder), a simple yet effective generative pre-training method for dense passage retrieval. CoT-MAE employs an asymmetric encoder-decoder architecture that learns to compress the sentence semantics into a dense vector through self-supervised and context-supervised masked auto-encoding. Precisely, self-supervised masked auto-encoding learns to model the semantics of the tokens inside a text span, and context-supervised masked auto-encoding learns to model the semantical correlation between the text spans. We conduct experiments on large-scale passage retrieval benchmarks and show considerable improvements over strong baselines, demonstrating the high efficiency of CoT-MAE. Our code is available at https://github.com/caskcsg/ir/tree/main/cotmae.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes