CL AIMay 10

Align and Shine: Building High-Quality Sentence-Aligned Corpora for Multilingual Text Simplification

Kenji Hilasaca, Nouran Khallaf, Serge Sharoff

arXiv:2605.0947677.1

AI Analysis

For researchers in multilingual text simplification, this provides a new resource to train and evaluate models, though the approach is incremental.

This work addresses the scarcity of high-quality multilingual text simplification datasets by constructing a sentence-aligned corpus from crowd-sourced comparable data for five languages (Catalan, English, French, Italian, Spanish). The resulting dataset is publicly released.

Text simplification plays a crucial role in improving the accessibility and comprehensibility of written information for diverse audiences, including language learners and readers with limited literacy. Despite its importance, large-scale, high-quality datasets for training and evaluating text simplification models remain scarce for languages other than English. This paper reports an experimental study on the collection and processing of crowd-sourced simplification data from comparable corpora to construct a corpus suitable for both training and testing text simplification systems across multiple languages (Catalan, English, French, Italian and Spanish). We report mechanisms for sentence-level alignment from document-level data. The resulting dataset of the aligned sentence pairs is publicly available.

View on arXiv PDF

Similar