CLFeb 19, 2025

A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment

Khalid N. Elmadani, Nizar Habash, Hanada Taha-Thomure

arXiv:2502.13520v214.726 citationsh-index: 9ACL

Originality Synthesis-oriented

AI Analysis

This provides a comprehensive resource for researchers and educators working on Arabic text complexity, though it is incremental as it focuses on dataset creation and benchmarking.

The paper tackles the problem of Arabic readability assessment by introducing BAREC, a large-scale, fine-grained dataset with 69,441 sentences covering 19 readability levels and achieving an inter-annotator agreement of 81.8%, and benchmarks show competitive performance across various methods.

This paper introduces the Balanced Arabic Readability Evaluation Corpus (BAREC), a large-scale, fine-grained dataset for Arabic readability assessment. BAREC consists of 69,441 sentences spanning 1+ million words, carefully curated to cover 19 readability levels, from kindergarten to postgraduate comprehension. The corpus balances genre diversity, topical coverage, and target audiences, offering a comprehensive resource for evaluating Arabic text complexity. The corpus was fully manually annotated by a large team of annotators. The average pairwise inter-annotator agreement, measured by Quadratic Weighted Kappa, is 81.8%, reflecting a high level of substantial agreement. Beyond presenting the corpus, we benchmark automatic readability assessment across different granularity levels, comparing a range of techniques. Our results highlight the challenges and opportunities in Arabic readability modeling, demonstrating competitive performance across various methods. To support research and education, we make BAREC openly available, along with detailed annotation guidelines and benchmark results.

View on arXiv PDF

Similar