CLMar 31, 2022

A Baseline Readability Model for Cebuano

arXiv:2203.17225v3630 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This provides a foundational tool for readability assessment in Cebuano, though it is incremental as it adapts existing methods from similar languages.

The authors developed the first baseline readability model for Cebuano, a Philippine language with 27.5 million speakers, achieving approximately 87% accuracy across all metrics using handcrafted linguistic features and Random Forest.

In this study, we developed the first baseline readability model for the Cebuano language. Cebuano is the second most-used native language in the Philippines with about 27.5 million speakers. As the baseline, we extracted traditional or surface-based features, syllable patterns based from Cebuano's documented orthography, and neural embeddings from the multilingual BERT model. Results show that the use of the first two handcrafted linguistic features obtained the best performance trained on an optimized Random Forest model with approximately 87% across all metrics. The feature sets and algorithm used also is similar to previous results in readability assessment for the Filipino language showing potential of crosslingual application. To encourage more work for readability assessment in Philippine languages such as Cebuano, we open-sourced both code and data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes