BatchEval: Towards Human-like Text Evaluation
This work addresses the problem of more reliable and cost-effective text evaluation for researchers and practitioners in natural language processing, offering a novel paradigm but with incremental improvements over existing LLM-based methods.
The paper tackles the problem of automatic text evaluation by addressing issues like sensitivity to prompts and poor noise resistance in current methods, proposing BatchEval, a batch-wise evaluation paradigm that improves Pearson correlations by 10.5% over state-of-the-art methods with 64% API cost.
Significant progress has been made in automatic text evaluation with the introduction of large language models (LLMs) as evaluators. However, current sample-wise evaluation paradigm suffers from the following issues: (1) Sensitive to prompt design; (2) Poor resistance to noise; (3) Inferior ensemble performance with static reference. Inspired by the fact that humans treat both criterion definition and inter sample comparison as references for evaluation, we propose BatchEval, a paradigm that conducts batch-wise evaluation iteratively to alleviate the above problems. We explore variants under this paradigm and confirm the optimal settings are two stage procedure with heterogeneous batch composition strategy and decimal scoring format. Comprehensive experiments across 3 LLMs on 4 text evaluation tasks demonstrate that BatchEval outperforms state-of-the-art methods by 10.5% on Pearson correlations with only 64% API cost on average. Further analyses have been conducted to verify the robustness, generalization, and working mechanism of BatchEval.