CLNov 13, 2023

Developing a Named Entity Recognition Dataset for Tagalog

Cambridge

arXiv:2311.07161v1127 citationsh-index: 12

Originality Synthesis-oriented

AI Analysis

This addresses a resource gap for Philippine languages, enabling future NLP work on Tagalog, though it is incremental as it applies existing methods to new data.

The researchers tackled the lack of Named Entity Recognition resources for Tagalog by creating a dataset with ~7.8k documents across three entity types, achieving an inter-annotator agreement of 0.81.

We present the development of a Named Entity Recognition (NER) dataset for Tagalog. This corpus helps fill the resource gap present in Philippine languages today, where NER resources are scarce. The texts were obtained from a pretraining corpora containing news reports, and were labeled by native speakers in an iterative fashion. The resulting dataset contains ~7.8k documents across three entity types: Person, Organization, and Location. The inter-annotator agreement, as measured by Cohen's $κ$, is 0.81. We also conducted extensive empirical evaluation of state-of-the-art methods across supervised and transfer learning settings. Finally, we released the data and processing code publicly to inspire future work on Tagalog NLP.

View on arXiv PDF

Similar