CLNov 13, 2023

Developing a Named Entity Recognition Dataset for Tagalog

Cambridge
arXiv:2311.07161v1127 citationsh-index: 12
Originality Synthesis-oriented
AI Analysis

This addresses a resource gap for Philippine languages, enabling future NLP work on Tagalog, though it is incremental as it applies existing methods to new data.

The researchers tackled the lack of Named Entity Recognition resources for Tagalog by creating a dataset with ~7.8k documents across three entity types, achieving an inter-annotator agreement of 0.81.

We present the development of a Named Entity Recognition (NER) dataset for Tagalog. This corpus helps fill the resource gap present in Philippine languages today, where NER resources are scarce. The texts were obtained from a pretraining corpora containing news reports, and were labeled by native speakers in an iterative fashion. The resulting dataset contains ~7.8k documents across three entity types: Person, Organization, and Location. The inter-annotator agreement, as measured by Cohen's $κ$, is 0.81. We also conducted extensive empirical evaluation of state-of-the-art methods across supervised and transfer learning settings. Finally, we released the data and processing code publicly to inspire future work on Tagalog NLP.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes