CL AIOct 12, 2022

Perplexity from PLM Is Unreliable for Evaluating Text Quality

Yequan Wang, Jiawen Deng, Aixin Sun, Xuying Meng

arXiv:2210.05892v25.29 citationsh-index: 63

Originality Synthesis-oriented

AI Analysis

This addresses a critical issue for researchers and practitioners in NLP who rely on PPL for text evaluation, highlighting its limitations as an incremental but important critique.

The paper tackles the problem of using perplexity (PPL) from pre-trained language models to evaluate text quality, finding it unreliable due to issues like length bias and punctuation sensitivity, with experiments demonstrating these flaws.

Recently, amounts of works utilize perplexity~(PPL) to evaluate the quality of the generated text. They suppose that if the value of PPL is smaller, the quality(i.e. fluency) of the text to be evaluated is better. However, we find that the PPL referee is unqualified and it cannot evaluate the generated text fairly for the following reasons: (i) The PPL of short text is larger than long text, which goes against common sense, (ii) The repeated text span could damage the performance of PPL, and (iii) The punctuation marks could affect the performance of PPL heavily. Experiments show that the PPL is unreliable for evaluating the quality of given text. Last, we discuss the key problems with evaluating text quality using language models.

View on arXiv PDF

Similar