CLMay 24, 2022

Universal Dependency Treebank for Odia Language

Shantipriya Parida, Kalyanamalini Sahoo, Atul Kr. Ojha, Saraswati Sahoo, Satya Ranjan Dash, Bijayalaxmi Dash

arXiv:2205.11976v131.0586 citationsh-index: 15

Originality Synthesis-oriented

AI Analysis

It addresses the lack of language resources for Odia, enabling tools for cross-lingual learning and typological research, but is incremental as it applies existing methods to new data.

This paper presents the first publicly available Universal Dependency treebank for Odia, a low-resource Indian language, containing 100 sentences (1082 tokens) manually annotated, and builds a preliminary parser with accuracies up to 86.6% for tokenization.

This paper presents the first publicly available treebank of Odia, a morphologically rich low resource Indian language. The treebank contains approx. 1082 tokens (100 sentences) in Odia selected from "Samantar", the largest available parallel corpora collection for Indic languages. All the selected sentences are manually annotated following the ``Universal Dependency (UD)" guidelines. The morphological analysis of the Odia treebank was performed using machine learning techniques. The Odia annotated treebank will enrich the Odia language resource and will help in building language technology tools for cross-lingual learning and typological research. We also build a preliminary Odia parser using a machine learning approach. The accuracy of the parser is 86.6% Tokenization, 64.1% UPOS, 63.78% XPOS, 42.04% UAS and 21.34% LAS. Finally, the paper briefly discusses the linguistic analysis of the Odia UD treebank.

View on arXiv PDF

Similar