BERT Embeddings for Automatic Readability Assessment
This addresses the problem of improving readability assessment for low-resource languages like Filipino, where NLP tools are limited, though it is incremental as it builds on existing BERT and feature-based methods.
The study tackled automatic readability assessment by combining BERT embeddings with handcrafted linguistic features, resulting in a method that outperformed classical approaches with up to a 12.4% increase in F1 performance on English and Filipino datasets.
Automatic readability assessment (ARA) is the task of evaluating the level of ease or difficulty of text documents for a target audience. For researchers, one of the many open problems in the field is to make such models trained for the task show efficacy even for low-resource languages. In this study, we propose an alternative way of utilizing the information-rich embeddings of BERT models with handcrafted linguistic features through a combined method for readability assessment. Results show that the proposed method outperforms classical approaches in readability assessment using English and Filipino datasets, obtaining as high as 12.4% increase in F1 performance. We also show that the general information encoded in BERT embeddings can be used as a substitute feature set for low-resource languages like Filipino with limited semantic and syntactic NLP tools to explicitly extract feature values for the task.