CLDec 20, 2022

Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

Microsoft
arXiv:2212.10168v2237 citationsh-index: 41Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited NER resources for Indic languages, benefiting NLP researchers and practitioners, though it is incremental as it builds on existing methods for dataset creation.

The authors tackled the lack of large-scale named entity recognition datasets for Indian languages by creating Naamapadam, a dataset with over 400k sentences and at least 100k entities across 11 languages, and demonstrated its utility with IndicNER achieving F1 scores above 80 for 7 out of 9 languages.

We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location, and, Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language translation. We also create manually annotated testsets for 9 languages. We demonstrate the utility of the obtained dataset on the Naamapadam-test dataset. We also release IndicNER, a multilingual IndicBERT model fine-tuned on Naamapadam training set. IndicNER achieves an F1 score of more than $80$ for $7$ out of $9$ test languages. The dataset and models are available under open-source licences at https://ai4bharat.iitm.ac.in/naamapadam.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes