CLSep 25, 2019

Developing a Fine-Grained Corpus for a Less-resourced Language: the case of Kurdish

arXiv:1909.11467v11092 citations
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited language processing resources for Kurdish speakers and researchers, though it is incremental as it builds on existing corpus creation methods.

The authors tackled the lack of corpora for Kurdish, a less-resourced language, by creating KTC, a fine-grained corpus from 31 K-12 textbooks in the Sorani dialect, resulting in 693,800 tokens categorized into 12 educational subjects.

Kurdish is a less-resourced language consisting of different dialects written in various scripts. Approximately 30 million people in different countries speak the language. The lack of corpora is one of the main obstacles in Kurdish language processing. In this paper, we present KTC-the Kurdish Textbooks Corpus, which is composed of 31 K-12 textbooks in Sorani dialect. The corpus is normalized and categorized into 12 educational subjects containing 693,800 tokens (110,297 types). Our resource is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes