CLOct 5, 2014

Corpora Preparation and Stopword List Generation for Arabic data in Social Network

Walaa Medhat, Ahmed H. Yousef, Hoda Korashy

arXiv:1410.1135v17 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the need for better natural language processing tools for Arabic dialects in social media, though it is incremental as it builds on existing methods for stopword removal.

The paper tackled the problem of sentiment analysis for Arabic social network data by preparing corpora and generating stopword lists for Egyptian dialect, showing that using dialect-specific lists improved text classification performance over Modern Standard Arabic lists.

This paper proposes a methodology to prepare corpora in Arabic language from online social network (OSN) and review site for Sentiment Analysis (SA) task. The paper also proposes a methodology for generating a stopword list from the prepared corpora. The aim of the paper is to investigate the effect of removing stopwords on the SA task. The problem is that the stopwords lists generated before were on Modern Standard Arabic (MSA) which is not the common language used in OSN. We have generated a stopword list of Egyptian dialect and a corpus-based list to be used with the OSN corpora. We compare the efficiency of text classification when using the generated lists along with previously generated lists of MSA and combining the Egyptian dialect list with the MSA list. The text classification was performed using Naïve Bayes and Decision Tree classifiers and two feature selection approaches, unigrams and bigram. The experiments show that the general lists containing the Egyptian dialects words give better performance than using lists of MSA stopwords only.

View on arXiv PDF

Similar