CLApr 3, 2023

Approaches to Corpus Creation for Low-Resource Language Technology: the Case of Southern Kurdish and Laki

CMU
arXiv:2304.01319v1225 citationsh-index: 33
Originality Synthesis-oriented
AI Analysis

This work addresses the lack of technological tools for endangered and under-represented language communities, though it is incremental as it applies existing corpus creation methods to new data.

The paper tackled the problem of limited language data for Southern Kurdish and Laki by creating corpora using local news websites, radio broadcasts, and fieldwork, resulting in the development of resources for these under-represented languages and a study on language identification among Kurdish variants.

One of the major challenges that under-represented and endangered language communities face in language technology is the lack or paucity of language data. This is also the case of the Southern varieties of the Kurdish and Laki languages for which very limited resources are available with insubstantial progress in tools. To tackle this, we provide a few approaches that rely on the content of local news websites, a local radio station that broadcasts content in Southern Kurdish and fieldwork for Laki. In this paper, we describe some of the challenges of such under-represented languages, particularly in writing and standardization, and also, in retrieving sources of data and retro-digitizing handwritten content to create a corpus for Southern Kurdish and Laki. In addition, we study the task of language identification in light of the other variants of Kurdish and Zaza-Gorani languages.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes