CLJun 1, 2019

Latent Retrieval for Weakly Supervised Open Domain Question Answering

arXiv:1906.00300v31419 citations
Originality Highly original
AI Analysis

This addresses the challenge of weakly supervised QA for users needing accurate answers without strong evidence supervision, though it is incremental in jointly optimizing retrieval and reading.

The paper tackles the problem of open domain question answering by jointly learning a retriever and reader from question-answer pairs without relying on gold evidence or blackbox IR systems, achieving up to 19 points improvement in exact match over BM25 on datasets where users genuinely seek answers.

Recent work on open domain question answering (QA) assumes strong supervision of the supporting evidence and/or assumes a blackbox information retrieval (IR) system to retrieve evidence candidates. We argue that both are suboptimal, since gold evidence is not always available, and QA is fundamentally different from IR. We show for the first time that it is possible to jointly learn the retriever and reader from question-answer string pairs and without any IR system. In this setting, evidence retrieval from all of Wikipedia is treated as a latent variable. Since this is impractical to learn from scratch, we pre-train the retriever with an Inverse Cloze Task. We evaluate on open versions of five QA datasets. On datasets where the questioner already knows the answer, a traditional IR system such as BM25 is sufficient. On datasets where a user is genuinely seeking an answer, we show that learned retrieval is crucial, outperforming BM25 by up to 19 points in exact match.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes