CLApr 30, 2024

HistNERo: Historical Named Entity Recognition for the Romanian Language

arXiv:2405.00155v15 citationsh-index: 13ICDAR
Originality Synthesis-oriented
AI Analysis

This addresses the problem of historical text analysis for Romanian language researchers, but it is incremental as it applies existing methods to a new dataset.

This work tackles Named Entity Recognition (NER) for historical Romanian newspapers by introducing HistNERo, the first Romanian corpus for this task, and achieves a strict F1-score of 66.80% using a novel domain adaptation technique, representing an absolute gain of over 10%.

This work introduces HistNERo, the first Romanian corpus for Named Entity Recognition (NER) in historical newspapers. The dataset contains 323k tokens of text, covering more than half of the 19th century (i.e., 1817) until the late part of the 20th century (i.e., 1990). Eight native Romanian speakers annotated the dataset with five named entities. The samples belong to one of the following four historical regions of Romania, namely Bessarabia, Moldavia, Transylvania, and Wallachia. We employed this proposed dataset to perform several experiments for NER using Romanian pre-trained language models. Our results show that the best model achieved a strict F1-score of 55.69%. Also, by reducing the discrepancies between regions through a novel domain adaption technique, we improved the performance on this corpus to a strict F1-score of 66.80%, representing an absolute gain of more than 10%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes