CLSep 25, 2019

Developing a Fine-Grained Corpus for a Less-resourced Language: the case of Kurdish

Roshna Omer Abdulrahman, Hossein Hassani, Sina Ahmadi

arXiv:1909.11467v131.01092 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of limited language processing resources for Kurdish speakers and researchers, though it is incremental as it builds on existing corpus creation methods.

The authors tackled the lack of corpora for Kurdish, a less-resourced language, by creating KTC, a fine-grained corpus from 31 K-12 textbooks in the Sorani dialect, resulting in 693,800 tokens categorized into 12 educational subjects.

Kurdish is a less-resourced language consisting of different dialects written in various scripts. Approximately 30 million people in different countries speak the language. The lack of corpora is one of the main obstacles in Kurdish language processing. In this paper, we present KTC-the Kurdish Textbooks Corpus, which is composed of 31 K-12 textbooks in Sorani dialect. The corpus is normalized and categorized into 12 educational subjects containing 693,800 tokens (110,297 types). Our resource is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license.

View on arXiv PDF Code

Similar