A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability
This work addresses meta-evaluation issues for NLG researchers, offering improved interpretability and reduced annotation effort, but it is incremental as it builds on existing meta-evaluation approaches.
The paper tackles limitations in NLG meta-evaluation by proposing a dual-perspective framework for better interpretability and an automatic benchmark construction method, analyzing 16 LLMs as evaluators to comprehensively assess their performance.
In NLG meta-evaluation, evaluation metrics are typically assessed based on their consistency with humans. However, we identify some limitations in traditional NLG meta-evaluation approaches, such as issues in handling human ratings and ambiguous selections of correlation measures, which undermine the effectiveness of meta-evaluation. In this work, we propose a dual-perspective NLG meta-evaluation framework that focuses on different evaluation capabilities, thereby providing better interpretability. In addition, we introduce a method of automatically constructing the corresponding benchmarks without requiring new human annotations. Furthermore, we conduct experiments with 16 representative LLMs as the evaluators based on our proposed framework, comprehensively analyzing their evaluation performance from different perspectives.