CLAILGSep 5, 2025

Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too

arXiv:2509.05440v1h-index: 1
Originality Incremental advance
AI Analysis

This work addresses the need for absolute scoring in NLG evaluation for thresholding applications, offering an incremental improvement over existing pairwise methods.

The paper tackled the problem of automatic evaluation for natural language generation by proposing a direct-scoring method that uses synthetic summaries to enable pairwise comparisons at test time, achieving comparable performance to state-of-the-art pairwise evaluators with axis-averaged sample-level correlations of +0.03 on SummEval, -0.03 on TopicalChat, and +0.05 on HANNA.

As large-language models have been increasingly used as automatic raters for evaluating free-form content, including document summarization, dialog, and story generation, work has been dedicated to evaluating such models by measuring their correlations with human judgment. For \textit{sample-level} performance, methods which operate by using pairwise comparisons between machine-generated text perform well but often lack the ability to assign absolute scores to individual summaries, an ability crucial for use cases that require thresholding. In this work, we propose a direct-scoring method which uses synthetic summaries to act as pairwise machine rankings at test time. We show that our method performs comparably to state-of-the-art pairwise evaluators in terms of axis-averaged sample-level correlations on the SummEval (\textbf{+0.03}), TopicalChat (\textbf{-0.03}), and HANNA (\textbf{+0.05}) meta-evaluation benchmarks, and release the synthetic in-context summaries as data to facilitate future work.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes