Reference-less Quality Estimation of Text Simplification Systems
This addresses the problem of evaluating text simplification for researchers and practitioners, but it is incremental as it adapts existing metrics rather than introducing a new paradigm.
The paper tackled the challenge of evaluating text simplification systems without requiring reference data, showing that n-gram-based metrics like BLEU and METEOR best correlate with human judgments for grammaticality and meaning preservation, while simplicity is best evaluated by length-based metrics.
The evaluation of text simplification (TS) systems remains an open challenge. As the task has common points with machine translation (MT), TS is often evaluated using MT metrics such as BLEU. However, such metrics require high quality reference data, which is rarely available for TS. TS has the advantage over MT of being a monolingual task, which allows for direct comparisons to be made between the simplified text and its original version. In this paper, we compare multiple approaches to reference-less quality estimation of sentence-level text simplification systems, based on the dataset used for the QATS 2016 shared task. We distinguish three different dimensions: gram-maticality, meaning preservation and simplicity. We show that n-gram-based MT metrics such as BLEU and METEOR correlate the most with human judgment of grammaticality and meaning preservation, whereas simplicity is best evaluated by basic length-based metrics.