CL AIMay 15, 2024

ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata

arXiv:2405.09496v124.083 citationsh-index: 16Has CodeLREC

Originality Synthesis-oriented

AI Analysis

This provides a large-scale, standardized name corpus for multilingual language processing, addressing data scarcity for tasks like translation and entity recognition across many languages.

The authors created ParaNames, a massively multilingual parallel name resource with 140 million names for 16.8 million entities across over 400 languages, using Wikidata as a source. They demonstrated its utility by improving performance on canonical name translation and multilingual named entity recognition tasks, with gains on all 10 languages evaluated.

We introduce ParaNames, a massively multilingual parallel name resource consisting of 140 million names spanning over 400 languages. Names are provided for 16.8 million entities, and each entity is mapped from a complex type hierarchy to a standard type (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate the usefulness of ParaNames on two tasks. First, we perform canonical name translation between English and 17 other languages. Second, we use it as a gazetteer for multilingual named entity recognition, obtaining performance improvements on all 10 languages evaluated.

View on arXiv PDF Code

Similar