CLSep 28, 2019

Creating a Large Multi-Layered Representational Repository of Linguistic Code Switched Arabic Data

Mona Diab, Mahmoud Ghoneim, Abdelati Hawwari, Fahad AlGhamdi, Nada AlMarwani, Mohamed Al-Badrashiny

arXiv:1909.13009v131.01090 citations

Originality Synthesis-oriented

AI Analysis

This addresses the need for annotated data in natural language processing for Arabic code-switching, though it is incremental as it builds on existing annotation efforts.

The researchers tackled the problem of limited resources for linguistic code-switched Arabic data by creating a large multi-layered repository, resulting in 886,252 tokens tagged with 16 code-switching tags and achieving an inter-annotator agreement of 93.1%.

We present our effort to create a large Multi-Layered representational repository of Linguistic Code-Switched Arabic data. The process involves developing clear annotation standards and Guidelines, streamlining the annotation process, and implementing quality control measures. We used two main protocols for annotation: in-lab gold annotations and crowd sourcing annotations. We developed a web-based annotation tool to facilitate the management of the annotation process. The current version of the repository contains a total of 886,252 tokens that are tagged into one of sixteen code-switching tags. The data exhibits code switching between Modern Standard Arabic and Egyptian Dialectal Arabic representing three data genres: Tweets, commentaries, and discussion fora. The overall Inter-Annotator Agreement is 93.1%.

View on arXiv PDF

Similar