CLIRApr 7, 2025

Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation

arXiv:2504.08792v112 citationsh-index: 39Proceedings of the Tenth Workshop on Noisy and User-generated Text
Originality Incremental advance
AI Analysis

This addresses the lack of annotated datasets for low-resource languages, which is an incremental advancement in NLP for specific linguistic communities.

The paper tackled the problem of Named Entity Recognition (NER) in low-resource Pakistani languages by proposing a cross-lingual data augmentation technique, resulting in significant performance improvements for Shahmukhi and Pashto.

Named Entity Recognition (NER), a fundamental task in Natural Language Processing (NLP), has shown significant advancements for high-resource languages. However, due to a lack of annotated datasets and limited representation in Pre-trained Language Models (PLMs), it remains understudied and challenging for low-resource languages. To address these challenges, we propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages; Urdu, Shahmukhi, Sindhi, and Pashto. By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto. We further explore the capability of generative LLMs for NER and data augmentation using few-shot learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes