CLDec 4, 2020

Event Guided Denoising for Multilingual Relation Learning

Amith Ananthram, Emily Allaway, Kathleen McKeown

arXiv:2012.02721v10.2h-index: 65Has Code

Originality Incremental advance

AI Analysis

This work provides a more data-efficient method for training multilingual relation extraction models, which is significant for researchers and practitioners in NLP who face high data collection costs.

This paper addresses the high data cost of distant supervision for relation extraction by proposing a method to collect high-quality training data from unlabeled text. Their approach achieves comparable zero-shot and few-shot results to a state-of-the-art method using significantly fewer examples (50k vs. 300 million+).

General purpose relation extraction has recently seen considerable gains in part due to a massively data-intensive distant supervision technique from Soares et al. (2019) that produces state-of-the-art results across many benchmarks. In this work, we present a methodology for collecting high quality training data for relation extraction from unlabeled text that achieves a near-recreation of their zero-shot and few-shot results at a fraction of the training cost. Our approach exploits the predictable distributional structure of date-marked news articles to build a denoised corpus -- the extraction process filters out low quality examples. We show that a smaller multilingual encoder trained on this corpus performs comparably to the current state-of-the-art (when both receive little to no fine-tuning) on few-shot and standard relation benchmarks in English and Spanish despite using many fewer examples (50k vs. 300mil+).

View on arXiv PDF Code

Similar