CLMar 14, 2025

OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

arXiv:2503.11858v36 citationsh-index: 10Has Code
Originality Incremental advance
AI Analysis

This addresses the need for reproducible and explainable evaluation metrics in NLG, offering an incremental improvement over existing LLM-based methods by eliminating reliance on proprietary models.

The paper tackles the problem of evaluating natural language generation systems by introducing OpeNLGauge, a fully open-source metric that provides accurate explanations based on error spans, achieving competitive correlation with human judgments and outperforming state-of-the-art models on certain tasks with explanations more than twice as accurate.

Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. In this paper, we introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes