CLMar 10

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

arXiv:2603.09403v176.8h-index: 10
Predicted impact top 79% in CL · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the scalability and cost issues in NLP evaluation metric validation, particularly for non-English datasets, though it is an incremental improvement on existing synthetic data methods.

The paper tackles the problem of validating NLP evaluation metrics, which typically requires expensive human annotations, by proposing a framework that uses LLMs to generate synthetic evaluation datasets through controlled semantic degradation of real data. The results show that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA.

Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose \textit{LLM as a Meta-Judge}, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using \textit{meta-correlation}, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data will become publicly available upon paper acceptance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes