CLJan 28, 2022

Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages

Silvia Severini, Ayyoob Imani, Philipp Dufter, Hinrich Schütze

arXiv:2201.12219v231.1586 citations

Originality Highly original

AI Analysis

This addresses the challenge of extracting named entities for underresourced languages, enabling applications like knowledge graph augmentation and bilingual lexicon induction.

The paper tackled the problem of creating a multilingual named entity resource for many languages, especially underresourced ones, by introducing CLC-BN, a method that learns a neural transliteration model from parallel-corpus statistics without additional resources, and applied it to over 1000 languages, outperforming prior work and releasing a resource for 1340 languages.

Parallel corpora are ideal for extracting a multilingual named entity (MNE) resource, i.e., a dataset of names translated into multiple languages. Prior work on extracting MNE datasets from parallel corpora required resources such as large monolingual corpora or word aligners that are unavailable or perform poorly for underresourced languages. We present CLC-BN, a new method for creating an MNE resource, and apply it to the Parallel Bible Corpus, a corpus of more than 1000 languages. CLC-BN learns a neural transliteration model from parallel-corpus statistics, without requiring any other bilingual resources, word aligners, or seed data. Experimental results show that CLC-BN clearly outperforms prior work. We release an MNE resource for 1340 languages and demonstrate its effectiveness in two downstream tasks: knowledge graph augmentation and bilingual lexicon induction.

View on arXiv PDF

Similar