ParaNames: A Massively Multilingual Entity Name Corpus
This provides a large-scale resource for multilingual language processing tasks like named entity recognition and linking, though it is incremental as it builds on existing Wikidata data.
The authors tackled the problem of multilingual entity name resources by creating ParaNames, a massively multilingual parallel name corpus with 118 million names across 400 languages for 13.6 million entities, and demonstrated its utility by training a model for canonical name translation.
We introduce ParaNames, a multilingual parallel name resource consisting of 118 million names spanning across 400 languages. Names are provided for 13.6 million entities which are mapped to standardized entity types (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to-date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate an application of ParaNames by training a multilingual model for canonical name translation to and from English. Our resource is released under a Creative Commons license (CC BY 4.0) at https://github.com/bltlab/paranames.