MariNER: A Dataset for Historical Brazilian Portuguese Named Entity Recognition
This addresses a gap in resources for digital humanities researchers analyzing historical Brazilian Portuguese texts, though it is incremental as it applies existing methods to new data.
The paper tackled the lack of gold-standard Named Entity Recognition (NER) datasets for historical Brazilian Portuguese by constructing MariNER, the first such dataset with over 9,000 manually annotated sentences from early 20th-century texts, and evaluated state-of-the-art NER models on it.
Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task that aims to identify and classify entity mentions in texts across different categories. While languages such as English possess a large number of high-quality resources for this task, Brazilian Portuguese still lacks in quantity of gold-standard NER datasets, especially when considering specific domains. Particularly, this paper considers the importance of NER for analyzing historical texts in the context of digital humanities. To address this gap, this work outlines the construction of MariNER: \textit{Mapeamento e Anotações de Registros hIstóricos para NER} (Mapping and Annotation of Historical Records for NER), the first gold-standard dataset for early 20th-century Brazilian Portuguese, with more than 9,000 manually annotated sentences. We also assess and compare the performance of state-of-the-art NER models for the dataset.