CLDec 28, 2016

Shamela: A Large-Scale Historical Arabic Corpus

Yonatan Belinkov, Alexander Magidow, Maxim Romanov, Avi Shmidman, Moshe Koppel

arXiv:1612.08989v112.524 citations

Originality Synthesis-oriented

AI Analysis

This provides a valuable resource for researchers in digital humanities and linguistics studying historical Arabic, though it is incremental as it builds on existing corpus creation methods.

The authors tackled the lack of a large-scale historical Arabic corpus by developing Shamela, a 1-billion-word dataset spanning diverse periods, which they cleaned, processed, and enhanced with parallel passage detection and automatic dating, demonstrating its utility in digital humanities case studies.

Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centuries. Yet existing Arabic corpora largely focus on the modern period or lack sufficient diachronic information. We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it by detecting parallel passages and automatically dating undated texts. We demonstrate its utility with selected case-studies in which we show its application to the digital humanities.

View on arXiv PDF

Similar