CEScore: Simple and Efficient Confidence Estimation Model for Evaluating Split and Rephrase
This provides a simple and effective metric for assessing SR models, addressing a key bottleneck in NLP evaluation.
The paper tackles the problem of automatically evaluating the quality of split and rephrase (SR) tasks in NLP, where complex sentences are simplified into shorter ones, by introducing CEScore, a statistical model that achieves a Spearman correlation of 0.98 with human evaluations across 26 models.
The split and rephrase (SR) task aims to divide a long, complex sentence into a set of shorter, simpler sentences that convey the same meaning. This challenging problem in NLP has gained increased attention recently because of its benefits as a pre-processing step in other NLP tasks. Evaluating quality of SR is challenging, as there no automatic metric fit to evaluate this task. In this work, we introduce CEScore, as novel statistical model to automatically evaluate SR task. By mimicking the way humans evaluate SR, CEScore provides 4 metrics (Sscore, Gscore, Mscore, and CEscore) to assess simplicity, grammaticality, meaning preservation, and overall quality, respectively. In experiments with 26 models, CEScore correlates strongly with human evaluations, achieving 0.98 in Spearman correlations at model-level. This underscores the potential of CEScore as a simple and effective metric for assessing the overall quality of SR models.