CLDec 9, 2021

Semantic Search as Extractive Paraphrase Span Detection

arXiv:2112.04886v16 citations
Originality Incremental advance
AI Analysis

This addresses semantic search for languages with limited annotated data, though it is incremental as it adapts an existing extractive QA setup.

The paper tackles semantic search by reframing it as paraphrase span detection, achieving improvements of up to 31.9 percentage points in exact match over baselines on a Finnish corpus.

In this paper, we approach the problem of semantic search by framing the search task as paraphrase span detection, i.e. given a segment of text as a query phrase, the task is to identify its paraphrase in a given document, the same modelling setup as typically used in extractive question answering. On the Turku Paraphrase Corpus of 100,000 manually extracted Finnish paraphrase pairs including their original document context, we find that our paraphrase span detection model outperforms two strong retrieval baselines (lexical similarity and BERT sentence embeddings) by 31.9pp and 22.4pp respectively in terms of exact match, and by 22.3pp and 12.9pp in terms of token-level F-score. This demonstrates a strong advantage of modelling the task in terms of span retrieval, rather than sentence similarity. Additionally, we introduce a method for creating artificial paraphrase data through back-translation, suitable for languages where manually annotated paraphrase resources for training the span detection model are not available.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes