CLApr 3, 2023

Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study

Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, Ruifeng Xu

arXiv:2304.00723v326.0180 citationsh-index: 54

Originality Incremental advance

AI Analysis

This work addresses the challenge of text quality evaluation for NLP researchers and practitioners, but it is incremental as it explores optimization of existing LLMs rather than introducing a new paradigm.

This paper tackled the problem of evaluating text quality in NLP by investigating the effectiveness of large language models like ChatGPT for reference-free evaluation, finding that ChatGPT outperforms most existing automatic metrics, with the Explicit Score method being the most effective and reliable.

Evaluating the quality of generated text is a challenging task in NLP, due to the inherent complexity and diversity of text. Recently, large language models (LLMs) have garnered significant attention due to their impressive performance in various tasks. Therefore, we present this paper to investigate the effectiveness of LLMs, especially ChatGPT, and explore ways to optimize their use in assessing text quality. We compared three kinds of reference-free evaluation methods. The experimental results prove that ChatGPT is capable of evaluating text quality effectively from various perspectives without reference and demonstrates superior performance than most existing automatic metrics. In particular, the Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches. However, directly comparing the quality of two texts may lead to suboptimal results. We believe this paper will provide valuable insights for evaluating text quality with LLMs and have released the used data.

View on arXiv PDF

Similar