CLApr 6, 2022

Urdu Morphology, Orthography and Lexicon Extraction

Muhammad Humayoun, Harald Hammarström, Aarne Ranta

arXiv:2204.03071v11.128 citationsh-index: 23

Originality Synthesis-oriented

AI Analysis

This provides a computational infrastructure for Urdu language processing, which is incremental as it applies an existing toolkit to a specific language.

The paper tackled the challenge of Urdu's complex script and morphology by implementing a software API for orthography, morphology, and lexicon extraction, using the Functional Morphology toolkit to enable reuse in applications like intelligent search and language training.

Urdu is a challenging language because of, first, its Perso-Arabic script and second, its morphological system having inherent grammatical forms and vocabulary of Arabic, Persian and the native languages of South Asia. This paper describes an implementation of the Urdu language as a software API, and we deal with orthography, morphology and the extraction of the lexicon. The morphology is implemented in a toolkit called Functional Morphology (Forsberg & Ranta, 2004), which is based on the idea of dealing grammars as software libraries. Therefore this implementation could be reused in applications such as intelligent search of keywords, language training and infrastructure for syntax. We also present an implementation of a small part of Urdu syntax to demonstrate this reusability.

View on arXiv PDF

Similar