CLApr 29, 2024

The SAMER Arabic Text Simplification Corpus

arXiv:2404.18615v187 citationsh-index: 8LREC
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited resources for Arabic text simplification and readability assessment, particularly for school-aged learners, though it is incremental as it builds on existing corpus creation methods for a new language domain.

The authors tackled the lack of manually annotated Arabic parallel corpora for text simplification by creating the SAMER Corpus, which includes 159K words from fiction novels with readability annotations and two simplified versions for different learner levels, and made it publicly available to support research.

We present the SAMER Corpus, the first manually annotated Arabic parallel corpus for text simplification targeting school-aged learners. Our corpus comprises texts of 159K words selected from 15 publicly available Arabic fiction novels most of which were published between 1865 and 1955. Our corpus includes readability level annotations at both the document and word levels, as well as two simplified parallel versions for each text targeting learners at two different readability levels. We describe the corpus selection process, and outline the guidelines we followed to create the annotations and ensure their quality. Our corpus is publicly available to support and encourage research on Arabic text simplification, Arabic automatic readability assessment, and the development of Arabic pedagogical language technologies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes