CLApr 3, 2023

PALI: A Language Identification Benchmark for Perso-Arabic Scripts

CMU
arXiv:2304.01322v1269 citationsh-index: 33
Originality Synthesis-oriented
AI Analysis

This addresses language identification for communities using Perso-Arabic scripts, but it is incremental as it builds on existing supervised methods.

The paper tackled the problem of identifying languages using Perso-Arabic scripts, especially in low-resource and bilingual contexts, by applying supervised techniques and a hierarchical model, with results indicating effectiveness.

The Perso-Arabic scripts are a family of scripts that are widely adopted and used by various linguistic communities around the globe. Identifying various languages using such scripts is crucial to language technologies and challenging in low-resource setups. As such, this paper sheds light on the challenges of detecting languages using Perso-Arabic scripts, especially in bilingual communities where ``unconventional'' writing is practiced. To address this, we use a set of supervised techniques to classify sentences into their languages. Building on these, we also propose a hierarchical model that targets clusters of languages that are more often confused by the classifiers. Our experiment results indicate the effectiveness of our solutions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes