CLDec 14, 2022

Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

arXiv:2212.07429v1h-index: 18
Originality Synthesis-oriented
AI Analysis

This addresses the need for annotated datasets in low-resource languages for NLP applications, but it is incremental as it follows an established framework.

The paper tackles the problem of creating multilingual datasets for named entity recognition in low-resource languages by presenting the UNER dataset, a hierarchical parallel corpus built using Wikipedia and DBpedia, with a post-processing step that increases entity identification.

With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes