IR CL MAFeb 8, 2021

An Enhanced Corpus for Arabic Newspapers Comments

Hichem Rahab, Abdelhafid Zitouni, Mahieddine Djoudi

arXiv:2102.09965v11 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the lack of a dedicated corpus for Algerian Arabic newspaper comments, which is a problem for researchers working on sentiment analysis in this specific dialect.

This paper developed an enhanced corpus for Algerian Arabic newspaper comments by collecting data from three Algerian newspapers and annotating it for sentiment. They used SVM, Naive Bayes, and k-NN classifiers to categorize comments into positive and negative classes, finding that stemming did not significantly improve classification accuracy due to the dialectal nature of the comments.

In this paper, we propose our enhanced approach to create a dedicated corpus for Algerian Arabic newspapers comments. The developed approach has to enhance an existing approach by the enrichment of the available corpus and the inclusion of the annotation step by following the Model Annotate Train Test Evaluate Revise (MATTER) approach. A corpus is created by collecting comments from web sites of three well know Algerian newspapers. Three classifiers, support vector machines, na{ï}ve Bayes, and k-nearest neighbors, were used for classification of comments into positive and negative classes. To identify the influence of the stemming in the obtained results, the classification was tested with and without stemming. Obtained results show that stemming does not enhance considerably the classification due to the nature of Algerian comments tied to Algerian Arabic Dialect. The promising results constitute a motivation for us to improve our approach especially in dealing with non Arabic sentences, especially Dialectal and French ones.

View on arXiv PDF

Similar