Critical Survey of the Freely Available Arabic Corpora
This addresses the need for accessible language resources in the Arabic NLP community, but it is incremental as it compiles existing data rather than creating new methods or corpora.
The authors tackled the problem of limited access to freely available Arabic corpora for NLP research by conducting a survey, resulting in an initial list of 66 sources with direct links provided.
The availability of corpora is a major factor in building natural language processing applications. However, the costs of acquiring corpora can prevent some researchers from going further in their endeavours. The ease of access to freely available corpora is urgent needed in the NLP research community especially for language such as Arabic. Currently, there is not easy was to access to a comprehensive and updated list of freely available Arabic corpora. We present in this paper, the results of a recent survey conducted to identify the list of the freely available Arabic corpora and language resources. Our preliminary results showed an initial list of 66 sources. We presents our findings in the various categories studied and we provided the direct links to get the data when possible.