CLJun 5, 2025

Identifying Reliable Evaluation Metrics for Scientific Text Revision

arXiv:2506.04772v35 citationsh-index: 5ACL
Originality Incremental advance
AI Analysis

This addresses the challenge of reliable evaluation for scientific text revision, which is incremental as it builds on existing methods.

The paper tackled the problem of evaluating text revision in scientific writing by analyzing limitations of traditional metrics and exploring alternatives, finding that a hybrid approach combining LLM-as-a-judge and task-specific metrics offers the most reliable assessment.

Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes