CLJun 16, 2023

RED$^{\rm FM}$: a Filtered and Multilingual Relation Extraction Dataset

Pere-Lluís Huguet Cabot, Simone Tedeschi, Axel-Cyrille Ngonga Ngomo, Roberto Navigli

arXiv:2306.09802v22.919 citationsh-index: 50Has Code

Originality Incremental advance

AI Analysis

This work addresses the lack of comprehensive multilingual datasets for relation extraction, enabling better training and evaluation of systems across languages.

The authors tackled the problem of limited multilingual relation extraction datasets by introducing SRED$^{\rm FM}$, an automatically annotated dataset covering 18 languages and 400 relation types with over 40 million triplets, and RED$^{\rm FM}$, a human-revised dataset for seven languages, and demonstrated their utility with the mREBEL model.

Relation Extraction (RE) is a task that identifies relationships between entities in a text, enabling the acquisition of relational facts and bridging the gap between natural language and structured knowledge. However, current RE models often rely on small datasets with low coverage of relation types, particularly when working with languages other than English. In this paper, we address the above issue and provide two new resources that enable the training and evaluation of multilingual RE systems. First, we present SRED$^{\rm FM}$, an automatically annotated dataset covering 18 languages, 400 relation types, 13 entity types, totaling more than 40 million triplet instances. Second, we propose RED$^{\rm FM}$, a smaller, human-revised dataset for seven languages that allows for the evaluation of multilingual RE systems. To demonstrate the utility of these novel datasets, we experiment with the first end-to-end multilingual RE model, mREBEL, that extracts triplets, including entity types, in multiple languages. We release our resources and model checkpoints at https://www.github.com/babelscape/rebel

View on arXiv PDF Code

Similar