OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages
This work addresses the need for accessible and uniform NER datasets for researchers in multilingual and multi-ontology settings, though it is incremental as it compiles and standardizes existing resources.
The authors tackled the problem of inconsistent and scattered named entity recognition (NER) datasets by creating OpenNER 1.0, a standardized collection of 36 corpora across 52 languages, and found that no single model performed best across all languages, with significant gaps in LLM performance.
We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation with consistent entity type names across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. We provide baseline results using three pretrained multilingual language models and two large language models to compare the performance of recent models and facilitate future research in NER. We find that no single model is best in all languages and that significant work remains to obtain high performance from LLMs on the NER task.