CLOct 17, 2025

Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry

Bolei Ma, Yina Yao, Anna-Carolina Haensch

arXiv:2510.15313v12.71 citationsh-index: 9

Originality Incremental advance

AI Analysis

This work addresses the need for better evaluation practices in creative AI tasks, particularly for culturally complex domains like classical poetry, though it is incremental in refining existing methods.

The study tackled the problem of evaluating large language models (LLMs) in generating classical Chinese poetry, specifically Tang poetry, by proposing a three-step evaluation framework. The results revealed systematic biases in both generation and evaluation, such as 'echo chamber' effects where LLMs converged on flawed standards that diverged from human judgments.

Large Language Models (LLMs) are increasingly applied to creative domains, yet their performance in classical Chinese poetry generation and evaluation remains poorly understood. We propose a three-step evaluation framework that combines computational metrics, LLM-as-a-judge assessment, and human expert validation. Using this framework, we evaluate six state-of-the-art LLMs across multiple dimensions of poetic quality, including themes, emotions, imagery, form, and style. Our analysis reveals systematic generation and evaluation biases: LLMs exhibit "echo chamber" effects when assessing creative quality, often converging on flawed standards that diverge from human judgments. These findings highlight both the potential and limitations of current capabilities of LLMs as proxy for literacy generation and the limited evaluation practices, thereby demonstrating the continued need of hybrid validation from both humans and models in culturally and technically complex creative tasks.

View on arXiv PDF

Similar