CLSep 11, 2018

Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

arXiv:1809.03891v153 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of limited historical language analysis for Arabic linguists and researchers, though it is incremental as it builds on existing periodizations with new data and methods.

The researchers tackled the lack of large-scale historical data for Arabic by creating a 1400-year corpus and using NLP tools to analyze it, confirming established periodizations like Modern Standard and Classical Arabic while suggesting further subdivisions.

Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties. Therefore, studying the history of the language has so far been mostly limited to manual analyses on a small scale. In this work, we present a large-scale historical corpus of the written Arabic language, spanning 1400 years. We describe our efforts to clean and process this corpus using Arabic NLP tools, including the identification of reused text. We study the history of the Arabic language using a novel automatic periodization algorithm, as well as other techniques. Our findings confirm the established division of written Arabic into Modern Standard and Classical Arabic, and confirm other established periodizations, while suggesting that written Arabic may be divisible into still further periods of development.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes